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Preface 



This monograph deals with adaptive supervised classification, using tools bor- 
rowed from statistical mechanics and information theory, stemming from the PAC- 
Bayesian approach pioneered by David McAllester and applied to a conception of 
statistical learning theory forged by Vladimir Vapnik. Using convex analysis on the 
set of posterior probability measures, we show how to get local measures of the 
complexity of the classification model involving the relative entropy of posterior 
distributions with respect to Gibbs posterior measures. We then discuss relative 
bounds, comparing the generalization error of two classification rules, showing how 
the margin assumption of Mammen and Tsybakov can be replaced with some em- 
pirical measure of the covariance structure of the classification model. We show how 
to associate to any posterior distribution an effective temperature relating it to the 
Gibbs prior distribution with the same level of expected error rate, and how to esti- 
mate this effective temperature from data, resulting in an estimator whose expected 
error rate converges according to the best possible power of the sample size adap- 
tively under any margin and parametric complexity assumptions. We describe and 
study an alternative selection scheme based on relative bounds between estimators, 
and present a two step localization technique which can handle the selection of a 
parametric model from a family of those. We show how to extend systematically 
all the results obtained in the inductive setting to transductive learning, and use 
this to improve Vapnik's generalization bounds, extending them to the case when 
the sample is made of independent non-identically distributed pairs of patterns and 
labels. Finally we review briefly the construction of Support Vector Machines and 
show how to derive generalization bounds for them, measuring the complexity ei- 
ther through the number of support vectors or through the value of the transductive 
or inductive margin. 

Olivier Catoni 

CNRS - Laboratoire de Probability et Modeles Aleatoires, Universite Paris 6 
(site Chevaleret), 4 place Jussieu - Case 188, 75 252 Paris Cedex 05. 
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Introduction 



Among the possible approaches to pattern recognition, statistical learning theory 
has received a lot of attention in the last few years. Although a realistic pattern 
recognition scheme involves data pre-processing and post-processing that need a 
theory of their own, a central role is often played by some kind of supervised learning 
algorithm. This central building block is the subject we are going to analyse in these 
notes. 

Accordingly, we assume that we have prepared in some way or another a sample 
of N labelled patterns (Xi, Yi)f =1 , where Xi ranges in some pattern space X and Yi 
ranges in some finite label set y. We also assume that we have devised our experi- 
ment in such a way that the couples of random variables (Xi, Yi) are independent 
(but not necessarily equidistributed) . Here, randomness should be understood to 
come from the way the statistician has planned his experiment. He may for in- 
stance have drawn the AjS at random from some larger population of patterns the 
algorithm is meant to be applied to in a second stage. The labels Yi may have 
been set with the help of some external expertise (which may itself be faulty or 
contain some amount of randomness, so we do not assume that Yi is a function of 
Xi, and allow the couple of random variables (Xi,Yi) to follow any kind of joint 
distribution). In practice, patterns will be extracted from some high dimensional 
and highly structured data, such as digital images, speech signals, DNA sequences, 
etc. We will not discuss this pre-processing stage here, although it poses crucial 
problems dealing with segmentation and the choice of a representation. The aim 
of supervised classification is to choose some classification rule / : X — > ^ which 
predicts Y from X making as few mistakes as possible on average. 

The choice of / will be driven by a suitable use of the information provided by the 
sample (Xi,Yi)f =1 on the joint distribution of X and Y. Moreover, considering all 
the possible measurable functions / from X to y would not be feasible in practice 
and maybe more importantly not well founded from a statistical point of view, 
at least as soon as the pattern space X is large and little is known in advance 
about the joint distribution of patterns X and labels Y. Therefore, we will consider 
parametrized subsets of classification rules {f$ : X — > y ; 9 e <d m }, m G M, which 
may be grouped to form a big parameter set 6 — |J m£M m . 

The subject of this monograph is to introduce to statistical learning theory, and 
more precisely to the theory of supervised classification, a number of technical tools 
akin to statistical mechanics and information theory, dealing with the concepts of 
entropy and temperature. A central task will in particular be to control the mutual 
information between an estimated parameter and the observed sample. The focus 
will not be directly on the description of the data to be classified, but on the de- 
scription of the classification rules. As we want to deal with high dimensional data, 
we will be bound to consider high dimensional sets of candidate classification rules, 
and will analyse them with tools very similar to those used in statistical mechanics 
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to describe particle systems with many degrees of freedom. More specifically, the 
sets of classification rules will be described by Gibbs measures defined on parameter 
sets and depending on the observed sample value. A Gibbs measure is the special 
kind of probability measure used in statistical mechanics to describe the state of a 
particle system driven by a given energy function at some given temperature. Here, 
Gibbs measures will emerge as minimizers of the average loss value under entropy 
(or mutual information) constraints. Entropy itself, more precisely the Kullback 
divergence function between probability measures, will emerge in conjunction with 
the use of exponential deviation inequalities: indeed, the log-Laplace transform may 
be seen as the Legendre transform of the Kullback divergence function, as will be 
stated in Lemma 11.1.31 (page [U • 

To fix notation, let (Xj, Yi)fL 1 be the canonical process on £1 = (X x y) N (which 
means the coordinate process). Let the pattern space be provided with a sigma- 
algebra 23 turning it into a measurable space (X, 23). On the finite label space y, we 
will consider the trivial algebra 23' made of all its subsets. Let [(DC x V) , (23 ® 
be our notation for the set of probability measures (i.e. of positive measures 
of total mass equal to 1) on the measurable space [(Xx^) N , (23x23') (8JV ] . Once some 
probability distribution P 6 [(Xxty N , (23®23')® JV ] is chosen, it turns (X,, Yi)f =1 
into the canonical realization of a stochastic process modelling the observed sample 
(also called the training set). We will assume that P = <2) i= 1 Pi, where for each 
i = 1, . . . , N, Pi € M+(X x y, 23 <g> 23'), to reflect the assumption that we observe 
independent pairs of patterns and labels. We will also assume that we are provided 
with some indexed set of possible classification rules 



where (O, T) is some measurable index set. Assuming some indexation of the classi- 
fication rules is just a matter of presentation. Although it leads to heavier notation, 
it allows us to integrate over the space of classification rules as well as over J7, us- 
ing the usual formalism of multiple integrals. For this matter, we will assume that 
(9, x) t— > fe(x) : (O x X, 23 ® T) — > (^ , 23') is a measurable function. 

In many cases, as already mentioned, O = UmeM ® m wm ^ e a finite (or more 
generally countable) union of subspaces, dividing the classification model DIq = 
UmeM -^©m m t° a union of sub-models. The importance of introducing such a 
structure has been put forward by V. Vapnik, as a way to avoid making strong 
hypotheses on the distribution P of the sample. If neither the distribution of the 
sample nor the set of classification rules were constrained, it is well known that no 
kind of statistical inference would be possible. Considering a family of sub-models is 
a way to provide for adaptive classification where the choice of the model depends on 
the observed sample. Restricting the set of classification rules is more realistic than 
restricting the distribution of patterns, since the classification rules are a processing 
tool left to the choice of the statistician, whereas the distribution of the patterns 
is not fully under his control, except for some planning of the learning experiment 
which may enforce some weak properties like independence, but not the precise 
shapes of the marginal distributions Pi which are as a rule unknown distributions 
on some high dimensional space. 

In these notes, we will concentrate on general issues concerned with a natu- 
ral measure of risk, namely the expected error rate of each classification rule fg, 
expressed as 



(0.1) 
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As this quantity is unobserved, we will be led to work with the corresponding 
empirical error rate 



1 N 

(0-2) r(0,w) = -5>[M*<)^] 



This does not mean that practical learning algorithms will always try to minimize 
this criterion. They often on the contrary try to minimize some other criterion which 
is linked with the structure of the problem and has some nice additional properties 
(like smoothness and convexity, for example). Nevertheless, and independently of 
the precise form of the estimator 6 : CI — ► G under study, the analysis of R(9) is a 
natural question, and often corresponds to what is required in practice. 

Answering this question is not straightforward because, although R(9) is the 
expectation of r(9), a sum of independent Bernoulli random variables, R{9) is not 
the expectation of r(9), because of the dependence of 9 on the sample, and neither 
is r(9) a sum of independent random variables. To circumvent this unfortunate 
situation, some uniform control over the deviations of r from R is needed. 

We will follow the PAC-Bayesian approach to this problem, origin ated in the 
machine learning community and pioneered by iMcAllester ( 19981 1999). It can be 



seen as some variant of the more classical approach o f M-estimators rely ing on 



empirical process theory — as described for instance in IVan de Geerl ((2000) . 
It is built on some general principles: 

• One idea is to embed the set of estimators of the type 9 : fl — > into the 
larger set of regular conditional probability measures p : (fi, (23 ® B')® ) — * 
Mi_(0, T). We will call these conditional probability measures posterior dis- 
tributions, to follow standard terminology. 

• A second idea is to measure the fluctuations of p with respect to the sample, 
using some prior distribution it G M^(0,T), and the Kullback divergence 
function %(p,ir). The expectation P{3C(p, 7r)} measures the randomness of 
p. The optimal choice of tt would be P(p), resulting in a measure of the 
randomness of p equal to the mutual information between the sample and the 
estimated parameter drawn from p. Anyhow, since P(p) is usually not better 
known than P, we will have to be content with some less concentrated prior 
distribution w, resulting in some looser measure of randomness, as shown by 
the identity P[3C(p, tt)] = P{0C[p, P(p)] } + X[W(p),n] . 

• A third idea is to analyse the fluctuations of the random process 9 i— > r(9) 
from its mean process 9 \— > R(9) through the log-Laplace transform 

-ilog exp[-\r(9,uj)]Tr{d9)P{duj) 

as would be done in statistical mechanics, where this is called the free energy. 
This transform is well suited to relate mingge r (9) to infgge R(9), since for 
large enough values of the parameter A, corresponding to low enough values 
of the temperature, the system has small fluctuations around its ground state. 

• A fourth idea deals with localization. It consists of considering a prior dis- 
tribution tt depending on the unknown expected error rate function R. Thus 
some central result of the theory will consist in an empirical upper bound for 
% \p, 

7r cx P (-/3i?)J i where 7r e x P (-/3_R)j defined by its density 
d r i exp(— PR) 
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is a Gibbs distribution built from a known prior distribution ir G Mt(6,T), 
some inverse temperature parameter (3 G R+ and the expected error rate R. 
This bound will in particular be used when p is a posterior Gibbs distribution, 
of the form 7r exp (_^ r ). The general idea will be to show that in the case when 
p is not too random, in the sense that it is possible to find a prior (that 
is non-random) distribution n such that 3C(p,7f) is small, then p(r) can be 
reliably taken for a good approximation of p{R). 

This monograph is divided into four chapters. The first deals with the inductive 
setting presented in these lines. The second is devoted to relative bounds. It shows 
that it is possible to obtain a tighter estimate of the mutual information between 
the sample and the estimated parameter by comparing prior and posterior Gibbs 
distributions. It shows how to use this idea to obtain adaptive model selection 
schemes under very weak hypotheses. 



The third chapter introduces the transductive setting of V. Vapnik (jVapnikJ, 



19981 ). which consists in comparing the performance of classification rules on the 
learning sample with their performance on a test sample instead of their average 
performance. The fourth one is a fast introduction to Support Vector Machines. 
It is the occasion to show the implications of the general results discussed in the 
three first chapters when some particular choice is made about the structure of the 
classification rules. 

In the first chapter, two types of bounds are shown. Empirical bounds are useful 
to build, compare and select estimators. Non random bounds are useful to assess the 
speed of convergence of estimators, relating this speed to the behaviour of the Gibbs 
prior expected error rate (3 i— > ^ exp (-pR)(R) and to covariance factors related to the 
margin assumption of Mammen and Tsybakov when a finer analysis is performed. 
We will proceed from the most straightforward bounds towards more elaborate 
ones, built to achieve a better asymptotic behaviour. In this course towards more 
sophisticated inequalities, we will introduce local bounds and relative bounds. 

The study of relative bounds is expanded in the third chapter, where tighter 
comparisons between prior and posterior Gibbs distributions are proved. Theorems 
12.1.31 (page [54]) and 12.2.41 (page [73]) present two ways of selecting some nearly opti- 
mal classification rule. They are both proved to be adaptive in all the parameters 
under Mammen and Tsybakov margin assumptions and parametric complexity as- 
sumptions. This is done in Corollary 12.1.171 (page [67]) of Theorem 12.1.151 (page 
[55]) and in Theorem 12.2.111 (page [55]). In the first approach, the performance of a 
randomized estimator modelled by a posterior distribution is compared with the 
performance of a prior Gibbs distribution. In the second approach posterior distri- 
butions are directly compared between themselves (and leads to slightly stronger 
results, to the price of using a more complex algorithm). When there are more than 
one parametric model, it is appropriate to use also some doubly localized scheme: 
two step localization is presented for both approaches, in Theorems l2.3.2l (pagelMl) 
and l2.3.9l fpage [T08| and provides bounds with a decreased influence of the number 
of empirically inefficient models included in the selection scheme. 

We would not like to induce the reader into thinking that the most sophisticated 
results presented in these first two chapters are necessarily the most useful ones, 
they are as a rule only more efficient asymptotically, whereas, being more involved, 
they use looser constants leading to less precision for small sample sizes. In practice 
whether a sample is to be considered small is a question of the ratio between the 
number of examples and the complexity (roughly speaking the number of parame- 
ters) of the model used for classification. Since our aim here is to describe methods 
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appropriate for complex data (images, speech, DNA, . . . ), we suspect that practi- 
tioners wanting to make use of our proposals will often be confronted with small 
sample sizes; thus we would advise them to try the simplest bounds first and only 
afterwards see whether the asymptotically better ones can bring some improvement. 

We would also like to point out that the results of the first two chapters are not 
of a purely theoretical nature: posterior parameter distributions can indeed be com- 
puted effectively, using Monte Carlo techniques, and there is well-established know- 
how about these computations in Bayesian statistics. Moreover, non-randomized 
estimators of the classical form 9 : £1 — > 9 can be efficiently approximated by pos- 
terior distributions p : ft — > M+(0) supported by a fairly narrow neighbourhood 
of 9, more precisely a neighbourhood of the size of the typical fluctuations of 9, so 
that this randomized approximation of 9 will most of the time provide the same 
classification as 9 itself, except for a small amount of dubious examples for which 
the classification provided by 9 would anyway be unreliable. This is explained on 
page CD 

As already mentioned, the third chapter is about the transductive setting, that 
is about comparing the performance of estimators on a training set and on a test 
set. We show first that this comparison can be based on a set of exponential devi- 
ation inequalities which parallels the one used in the inductive case. This gives the 
opportunity to transport all the results obtained in the inductive case in a system- 
atic way. In the transductive setting, the use of prior distributions can be extended 
to the use of partially exchangeable posterior distributions depending on the union 
of training and test patterns, bringing increased possibilities to adapt to the data 
and giving rise to such crucial notions of complexity as the Vapnik-Cervonenkis 
dimension. 

Having done so, we more specifically focus on the small sample case, where local 
and relative bounds are not expected to be of great help. Introducing a fictitious 
(that is unobserved) shadow sample, we study Vapnik-type generalization bounds, 
showing how to tighten and extend them with some original ideas, like making no 
Gaussian approximation to the log-Laplace transform of Bernoulli random vari- 
ables, using a shadow sample of arbitrary size, shrinking from the use of any sym- 
metrization trick, and using a suitable subset of the group of permutations to cover 
the case of independent non-identically distributed data. The culminating result 
of the third chapter is Theorem 13.3.31 (page fT25]l . subsequent bounds showing the 
separate influence of the above ideas and providing an easier comparison with Vap- 
nik's original results. Vapnik-type generalization bounds have a broad applicability, 
not only through the concept of Vapnik-Cervonen kis dimension, but also through 



the use of compression schemes ([Little et all I1986I ). which are briefly described on 
page Hill 

The beginning of the fourth chapter introduces Support Vector Machines, both 
in the separable and in the non-separable case (using the box constraint). We then 
describe different types of bounds. We start with compression scheme bounds, to 
proceed with margin bounds. We begin with transductive margin bounds, recalling 
on this occasion in Theorem l4.2.2l (page fT44|) the growth bound for a family of clas- 
sification rules with given Vapnik-Cervonenkis dimension. In Theorem 14.2.41 (page 
1 146(1 we give the usual estimate of the Vapnik-Cervonenkis dimension of a family 
of separating hyperplanes with a given transductive margin (we mean by this that 
the margin is computed on the union of the training and test sets). We present a n 
original probabilistic proof inspired by a similar one from ICristianini etall (2000), 
whereas other proofs available usually rely on the informal claim that the simplex 



xii 



Introduction 



is the worst case. We end this short review of Support Vector Machines with a dis- 
cussion of inductive margin bounds. Here the margin is co mputed on the tra ining 
set only, and a more involved combinatorial lemma, due to lAlon et al . (1997) and 
recalled in Lemma 14.2.61 (page I149|) is used. We use this lemma and the results of 
the third chapter to establish a bound depending on the margin of the training set 
alone. 

In appendix, we finally discuss the textbook example of classification by thresh- 
olding: in this setting, each classification rule is built by thresholding a series of 
measurements and taking a decision based on these thrcsholded values. This rel- 
atively simple example (which can be considered as an introduction to the more 
technical case of classification trees) can be used to give more flesh to the results 
of the first three chapters. 

It is a pleasure to end this introduction with my greatest thanks to Anthony 
Davison, for his careful reading of the manuscript and his numerous suggestions. 
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PAC-Bayesian 



The setting of inductive inference (as opposed to transductive inference to be dis- 
cussed later) is the one described in the introduction. 

When we will have to take the expectation of a random variable Z : SI — > M as 
well as of a function of the parameter h : 6 — > K with respect to some probability 
measure, we will as a rule use short functional notation instead of resorting to the 
integral sign: thus we will write ¥(Z) for j n Z(u))¥(du>) and n(h) for J" e h(0)-?v(d8). 

A more traditional statistical approach would focus on estimators 9 : Q — > 
of the parameter 9 and be interested on the relationship between the empirical 
error rate r{6), defined by equation (jO.ll page Iviii]) , which is the number of errors 
made on the sample, and the expected error rate R(9), defined by equation (|0.21 
page Hx| . which is the expected probability of error on new instances of patterns. 
The PAC-Bayesian approach instead chooses a broader perspective and allows the 
estimator 9 to be drawn at random using some auxiliary source of randomness to 
smooth the dependence of 9 on the sample. One way of representing the supple- 
mentary randomness allowed in the choice of 9, is to consider what it is usual to 
call posterior distributions on the parameter space, that is probability measures 
p : fl — » Mi_(0,T), depending on the sample, or from a technical perspective, 
regular conditional (or transition) probability measures. Let us recall that we use 
the model described in the introduction: the training sample is modelled by the 
canonical process (JQ, Yi)^L 1 on ft = (X x y) , and a product probability measure 
P = ^>^ =1 Pi on SI is considered to reflect the assumption that the training sam- 
ple is made of independent pairs of patterns and labels. The transition probability 
measure p, along with P G (f2) , defines a probability distribution on SI x 8 and 
describes the conditional distribution of the estimated parameter 9 knowing the 
sample (X^Yj)^. 

The main subject of this broadened theory becomes to investigate the relation- 
ship between p(r), the average error rate of 9 on the training sample, and p(R), the 
expected error rate of 9 on new samples. The first step towards using some kind 
of thermodynamics to tackle this question, is to consider the Laplace transform 
of p(R) — p(r), a well known provider of non-asymptotic deviation bounds. This 
transform takes the form 



{exp[A[p(i?)-p(r)]]}, 



1 



2 
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where some inverse temperature parameter A G R+, as a physicist would call it, is 
introduced. This Laplace transform would be easy to bound if p did not depend on 
uo G ft (namely on the sample), because p(R) would then be non-random, and 

1 N 

P(r-) = ^5>[1^/*PQ)], 
i=i 

would be a sum of independent random variables. It turns out, and this will be 
the subject of the next section, that this annoying dependence of p on u> can be 
quantified, using the inequality 

p{R) - p{r) < A" 1 log{7r[exp[A(i2 - r)]]} + \~ 1 %{p 1 n), 

which holds for any probability measure ir G Mi_(0) on the parameter space; 
for our purpose it will be appropriate to consider a prior distribution it that is 
non-random, as opposed to p, which depends on the sample. Here, %{p, it) is the 
Kullback divergence of p from tt, whose definition will be recalled when we will 
come to technicalities; it can be seen as an upper bound for the mutual information 
between the (Xi, Yi)f =1 and the estimated parameter 9 . This inequality will allow 
us to relate the penalized difference p{R) — p{r) — \~ 1 X(p,tt) with the Laplace 
transform of sums of independent random variables. 



1.1. Basic inequality 



Let us now come to the details of the investigation sketched above. The first thing 
we will do is to study the Laplace transform of R{9) — r{0), as a starting point for 
the more general study of p(R) — p{r): it corresponds to the simple case where 9 
is not random at all, and therefore where p is a Dirac mass at some deterministic 
parameter value 9. 

In the setting described in the introduction, let us consider the Bernoulli random 
variables Ui(9) = l[Yi ^ fg(Xi)] , which indicates whether the classification rule fe 
made an error on the ith component of the training sample. Using independence 
and the concavity of the logarithm function, it is readily seen that for any real 
constant A 



N 

log{p{cxp[-Ar(#)]}} =]Tlog{fl 



»=i 



} 



f 1 N 



The right-hand side of this inequality is the log-Laplace transform of a Bernoulli 
distribution with parameter X^i = R(8). As any Bernoulli distribution is 

fully defined by its parameter, this log-Laplace transform is necessarily a function 
of R{9). It can be expressed with the help of the family of functions 

(1.1) $ a (p) = -a- 1 log{l- [l-exp(-a)]p}, oGR,pG(0,l). 



It is immediately seen that $ a is an increasing one-to-one mapping of the unit 
interval onto itself, and that it is convex when a > 0, concave when a < and can 
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3 



be defined by continuity to be the identity when a = 0. Moreover the inverse of $ a 
is given by the formula 



x 1 - cxp(-qg) 

a {Q) 1 - exp(-o) ' 



a £ R,q £ (0,1). 



This formula may be used to extend 'f" 1 to q £ R, and we will use this extension 
without further notice when required. 

Using this notation, the previous inequality becomes 

log{p{exp[-Ar(0)]}} < -\<P^[R(6)], proving 
Lemma 1.1.1. For any real constant A and any parameter 6 £ 0, 
p|exp|A -K 61 ) }| < !• 

In previous versions of this study, we had used some Bernstein bound, instead 
of this lemma. Anyhow, as it will turn out, keeping the log-Laplace transform of a 
Bernoulli instead of approximating it provides simpler and tighter results. 

Lemma \1 . 1 . 1 1 implies that for any constants A £ R+ and e £)0, 1), 



> 1 - e. 



Choosing A £ argmax$ a. 



log(e) 



we deduce 



Lemma 1.1.2. For any e £)0, 1), any 6* £ 0, 



F< < inf 

AeR + jv 



-(<?)- 



log(e) 
A 



> 1 - e. 



We will illustrate throughout these notes the bounds we prove with a small 
numerical example: in the case where N = 1000, e = 0.01 and r(6 l ) = 0.2, we get 
with a confidence level of 0.99 that R(8) < .2402, this being obtained for A = 234. 

Now, to proceed towards the analysis of posterior distributions, let us put U\(0, 

u>) = A <i> a_ [i2((?)] — r(9,uj) for short, and let us consider some prior probability 

distribution n £ Mi_ (0,1). A proper choice of it will be an important question, 
underlying much of the material presented in this monograph, so for the time be- 
ing, let us only say that we will let this choice be as open as possible by writing 
inequalities which hold for any choice of tt . Let us insist on the fact that when we 
say that tt is a prior distribution, we mean that it does not depend on the training 
sample pQ, Y i )^ =1 . The quantity of interest to obtain the bound we are looking for 



that 



is log|p 7r[exp(l/ j \)]l |. Using Fubini's theorem for non-negative functions, we see 
log{p^[exp(C/ A )]] } = log{7r[p[exp(£/ A )]] } < 0. 



To relate this quantity to the expectation p(U\) with respect to any posterior 
distribution p : fi — > M+(0), we will use the properties of the Kullback divergence 



4 



Chapter 1. Inductive PAC-Bayesian learning 



X(p, tt) of p with respect to tt, which is defined as 
DC(p,7r) = 



J \og(^)dp, when p is absolutely continuous 



with respect to tt, 
otherwise. 



The following lemma shows in which sense the Kullback divergence function can be 
thought of as the dual of the log-Laplace transform. 



Lemma 1.1.3. For any bounded measurable function h : — > R 
bility distribution p £ M^(O) such that X(p, tt) < oo, 

log{7r[exp(/i)]} = p(h) - X(p, tt) + X(p, n cxp{h} ), 



and any proba- 



where by definition — c *pW — - . . . 

c(7r 7r[exp(/i)J 



exp[h{6)] 



Consequently 



log {tt [exp (ft,)]] } = sup p(h) - X(p,w). 

The proof is just a matter of writing down the definition of the quantities involved 
and using t he fact that th e Kullback divergence function is non-negative, and can 
be found in ICatonl (2004, page 160). In the duality between measurable functions 
and probability measures, we thus see that the log-Laplace transform with respect 
to tt is the Legendre transform of the Kullback divergence function with respect to 
tt. Using this, we get 



|exp{ sup p[U x (e)}-X(p,ir)}\<l, 



which, combined with the convexity of A3? x, proves the basic inequality we were 
looking for. 



Theorem 1.1.4. For any real constant X, 

$±[p(R)] -p(r)} -X(p,tt) 



exp 



sup A 



< 



exp 



sup A p($ x oR) — p(r) -X(p,tt) 



< 1. 



We insist on the fact that in this theorem, we take a supremum in p 6 Mi_(0) 
inside the expectation with respect to P, the sample distribution. This means that 
the proved inequality holds for any p depending on the training sample, that is for 
any posterior distribution: indeed, measurability questions set aside, 



jexp 


sup A 







sup 



exp 



X[p[U x (9)] -X(p,tt) 
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and more formally, 



sup \ 



exp 



p[U x {6)] -X(p,tt)] 



< 



exp 



sup A 



p[Ux{9)]-X(p,n) 



where the supremum in p taken in the left-hand side is restricted to regular condi- 
tional probability distributions. 

The following sections will show how to use this theorem. 



1.2. Non local bounds 

At least three sorts of bounds can be deduced from Theorem 11.1.41 

The most interesting ones with which to build estimators and tune parameters, 
as well as the first that have been considered in the development of the PAC- 
Bayesian approach, are deviation bounds. They provide an empirical upper bound 
for p(R) — that is a bound which can be computed from observed data — with 
some probability 1 — e, where e is a presumably small and tunable parameter setting 
the desired confidence level. 

Anyhow, most of the results about the convergence speed of estimators to be 
found in the statistical literature are concerned with the expectation P [p(-R)l , there- 
fore it is also enlightening to bound this quantity. In order to know at which rate 
it may be approaching infe R, a non-random upper bound is required, which will 
relate the average of the expected risk P[p(i?)l with the properties of the contrast 
function 9 h-> R(9). 

Since the values of constants do matter a lot when a bound is to be used to se- 
lect between various estimators using classification models of various complexities, 
a third kind of bound, related to the first, may be considered for the sake of its 
hopefully better constants: we will call them unbiased empirical bounds, to stress 
the fact that they provide some empirical quantity whose expectation under P can 
be proved to be an upper bound for P[p(i?)] , the average expected risk. The price 
to pay for these better constants is of course the lack of formal guarantee given by 
the bound: two random variables whose expectations are ordered in a certain way 
may very well be ordered in the reverse way with a large probability, so that basing 
the estimation of parameters or the selection of an estimator on some unbiased 
empirical bound is a hazardous business. Anyhow, since it is common practice to 
use the inequalities provided by mathematical statistical theory while replacing the 
proven constants with smaller values showing a better practical efficiency, consid- 
ering unbiased empirical bounds as well as deviation bounds provides an indication 
about how much the constants may be decreased while not violating the theory too 
much. 



1.2.1. Unbiased empirical bounds 

Let p : fl — > Mi_(0) be some fixed (and arbitrary) posterior distribution, describing 

some randomized estimator 8 : f2 — > O. As we already mentioned, in these notes a 
posterior distribution will always be a regular conditional probability measure. By 
this we mean that 
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is assumed to 



• for any A e 7, the map uj i-> p(w, A) : (ft, (S ® 2')®^) -> K+ 
be measurable; 

• for any oj € 17, the map ^4 i— > p(<x>, A) : T — > R+ is assumed to be a probability 
measure. 

We will also assume without further notice that the c-algebras we deal with are 
always countably generated. The tech nical implicati ons of these assumptions are 
standard and discussed for instance in ICatonil (|2004L pages 50-54), where, among 
other things, a detailed proof of the decomposition of the Kullback Liebler diver- 
gence is given. 

Let us restrict to the case when the constant A is positive. We get from Theorem 
QUathat 



(1.2) 



exp 



A{$ a [f[ P (R)]} - ¥[p(r)] } - F[X(p, tt)] 



< 1, 



where we have used the convexity of the exp function and of <!> a . Since we have 
restricted our attention to positive values of the constant A, equation (|1.2p can also 
be written 

V[p(R)] ^^jpfpW + A- 1 ^,^)]}, 

leading to 



Theorem 1.2.1. 

parameter X, 



For any posterior distribution p : fl 



Mi(8), for any positive 



F[p(R)] 



1 — exp 



< 



-JV _1 P[Ap(r) +X(p, tt)] 



< 



l-exp(-^; 
A 



7V[l-exp(-A)] 



p(r) 



X{p,ir) 



A 



The last inequality provides the unbiased empirical upper bound for p(R) we were 
looking for, meaning that the expectation of 



Jv[l-exp(-A)J 

tice that 1 < 



p(r) 



JC(p,7r) 



A 



is larger than the expectation of p(R). Let us no- 



< 



and therefore that this coefficient is close 



Jv[l-exp(-£)J 

to 1 when A is significantly smaller than iV. 

If we are ready to believe in this bound (although this belief is not mathematically 
well founded, as we already mentioned), we can use it to optimize A and to choose 
p. While the optimal choice of p when A is fixed is, according to Lemma fl. 1.31 (page 
H]), to take it equal to 7r cxp (_^ r ), a Gibbs posterior distribution, as it is sometimes 
called, we may for computational reasons be more interested in choosing p in some 
other class of posterior distributions. 

For instance^ our real interest may be to select some non-randomized estimator 
from a family 9 m : fl — > m , m G M, of possible ones, where Q m are measurable 
subsets of and where M is an arbitrary (non necessarily countable) index set. 
We may for instance think of the case when 6 rn £ argmine m r. We may slightly 
randomize the estimators to start with, considering for any 9 £ m and any m G M, 



A, 
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and denning p m by the formula 



df>m, as t[0eA m (9 m )] 
-(9) - 



dir 



n[A r 



Our posterior minimizes 3C(p, tt) among those distributions whose support is re- 
stricted to the values of 9 in m for which the classification rule fe is identical 
to the estimated one /g- on the observed sample. Presumably, in many practi- 
cal situations, fg(x) will be p m almost surely identical to /~ (x) when 9 is drawn 
from p m , for the vast majority of the values of x G X and all the sub- models TO 
not plagued with too much overfitting (since this is by construction the case when 
x G {Xi : i = 1, . . . , N}). Therefore replacing 9 m with p m can be expected to be a 
minor change in many situations. This change by the way can be estimated in the 
(admittedly not so common) case when the distribution of the patterns (Xi)^L 1 is 
known. Indeed, introducing the pseudo distance 



1 - 

(1.3) D(d,e') = -J2F[f e (X i )^f e ,(X i )], 9,9' 6 0, 



one immediately sees that R(9') < R(9) + D(9, 9'), for any 9, 9' G 0, and therefore 
that 

R(0m) <Pm{R)+Pm[D{-,9 m )]. 

Let us notice also that in the case where m C M. dm , and R happens to be convex on 
A m (9 m ), then p m {R) > R[J 9p m (d9)] , and we can replace 9 m with 9 m = J 9p m (d9), 
and obtain bounds for R(9 m ). This is not a very heavy assumption about R, in the 
case where we consider 9 m G argmine m r. Indeed, 9 m , and therefore A m (9 m ), will 
presumably be close to argmine m R, and requiring a function to be convex in the 
neighbourhood of its minima is not a very strong assumption. 

Since r(9 m ) = p m (r), and %(p m , n) — — log{7r [A m (6 l m )] }, our unbiased empiri- 
cal upper bound in this context reads as 

A f ~ log{7r[A m (0 m )]} 1 



N[l-eM~M I A J ' 

Let us notice that we obtain a complexity factor — log{7r[A m (0 m )] } which may be 
compared with the Vapnik-Cervonenkis dimension. Indeed, in the case of binary 
classification, when using a classification model with Vapnik-Cervonenkis dimen- 
sion not greater than h m , that is when any subset of X which can be split in any 
arbitrary way by some classification rule fe of the model m has at most h m points, 
then 

{A m (9) : 9 G 6 m } 

is a partition of m with at most (f^J components: these facts, if not already 
familiar to the reader, will be proved in Theorems 14.2.21 and 14.2.31 (page I144[) . 
Therefore 

inf - log{7r[A ro (0)l } < h m log (^L) ~ log[^(6 m )] . 
eee m \n m J 

Thus, if the model and prior distribution are well suited to the classification task, in 
the sense that there is more "room" (where room is measured with 7r) between the 
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two clusters defined by 9 m than between other partitions of the sample of patterns 
(Xi)fL 1 , then we will have 

-log{7r[A ro (?)]} ^log^J -log[vr(e m )]. 

An optimal value fh may be selected so that 

fl e „ mm I taf r - U m) - ) . 

6 mS M |agh + JV[l-exp(-^)] ^ V ' A yj 

Since p^ is still another posterior distribution, we can be sure that 

< infpj r - A1 f^)- l0SMA ^ )]} lV 
-A£R + \7V[l-cxp(-A)] ^ ™ J A yj 

Taking the infimum in A inside the expectation with respect to P would be possible 
at the price of some supplementary technicalities and a slight increase of the bound 
that we prefer to postpone to the discussion of deviation bounds, since they are the 
only ones to provide a rigorous mathematical foundation to the adaptive selection 
of estimators. 



1.2.2. Optimizing explicitly the exponential parameter A 



In this section we address some technical issues we think helpful to the under- 
standing of Theorem 11.2.11 (page [5]): namely to investigate how the upper bound 
it provides could be optimized, or at least approximately optimized, in A. It turns 
out that this can be done quite explicitly. 

So we will consider in this discussion the posterior distribution p : fl — > M^G) 
to be fixed, and our aim will be to eliminate the constant A from the bound by 
choosing its value in some nearly optimal way as a function of P[p(r)] , the average 
of the empirical risk, and of P[3C(p, 7r)] , which controls overfitting. 

Let the bound be written as 

<p(\) = [1 - exp(-A)] - 1 {i - exp [-^P[p(r)] - N^P^p, tt)]] } . 



We see that 
d 



N— log[^(A)] 



P[p(r)] 



exp 



&P[p(r)] +N-^[X(p,n)] 



-1 exp(A)-i' 



Thus, the optimal value for A is such that 

[exp(A) - l]P[p(r)] =exp[Ap[ p ( r )] +iV- 1 P[3C( /9 , 7 r)] 

Assuming that 1 ^> -^-P[/o(r)] 3> ""^(p.t)] , and keeping only higher order terms, we 
are led to choose 



A = 



/ 2NP[X( Pl n)] 
P[p(r)]{l-P[p(r)]}' 



obtaining 
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Theorem 1.2.2. For any posterior distribution p : fl — * Mi_(0), 

,_„„/_ / 2P[X(p,ir)]P[p(7)] V[0C(p,T)] \ 
r , 1 eX ^l V N{l-P[p(r)]} N f 
PUi? < — 

1 _ /_ / 2P[JC(p.7T)] \ 

1 eX P\ V NF[p(r)]{l-F[p(r)]} } 

This result of course is not very useful in itself, since neither of the two quantities 
P[p(r)] and P[3C(p, 7r)] are easy to evaluate. Anyhow it gives a hint that replacing 
them boldly with p(r) and 3C(p, it) could produce something close to a legitimate 
empirical upper bound for p(R). We will see in the subsection about deviation 
bounds that this is indeed essentially true. 

Let us remark that in the third chapter of this monograph, we will see another 
way of bounding 

inf \ q + 7 ] , leading to 

AGR+ — \ A J 



Theorem 1.2.3. For any prior distribution ir £ M^(O), for any posterior distri- 
bution p-.fl^MUG), 



, W1 <( 1+ »M)"U] + M 



'2P[3C(p,7r)]P[p(r)]{l-P[p(r)]} V[X(p,n) 



N 



as soon as P[p(r)] 




[np^)} < i 

2N ~ 2' 



andP[p(R)] < V[p{r)] 



2N 



N 2 



otherwise. 



This theorem enlightens the influence of three terms on the average expected 
risk: 

• the average empirical risk, P[p(r)] , which as a rule will decrease as the size of 
the classification model increases, acts as a bias term, grasping the ability of the 
model to account for the observed sample itself; 

• a variance term -^P[/?(r)l |1 — P[p(r)] } is due to the random fluctuations of 

P(r); 

• a complexity term P[3C(p,7r)] , which as a rule will increase with the size of 
the classification model, eventually acts as a multiplier of the variance term. 



We observed numerically that the bound provided by Theorem 11.2.21 is better 
than the more classical Vapnik-like bound of Theorem 11.2.31 For instance, when 
N = 1000, P[p(r)] = 0.2 and ¥[%{p,n)] = 10, Theorem [T22] gives a bound lower 
than 0.2604, whereas the more classical Vapnik-like approximation of Theorem ll.2.31 
gives a bound larger than 0.2622. Numerical simulations tend to suggest the two 
bounds are always ordered in the same way, although this could be a little tedious 
to prove mathematically. 
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1.2.3. Non random bounds 



It is time now to come to less tentative results and see how far is the average 
expected error rate P[/o(i?)] from its best possible value info R. 
Let us notice first that 

Xp{r) + %{p, tt) = %{p, 7r oxp( _ Ar) ) - log|7r[exp(-Ar)] |. 



Let us remark moreover that r i— > log 



7r[exp(— Ar)l 



is a convex functional, a prop- 



erty which from a technical point of view can be dealt with in the following way: 
(1.4) p{log[7r[exp(-Ar)]l } = p{ sup -Xp(r) - X(p, tt)) 

sup —Xp{R) — X(p, 7r) 
em;(0) 

log|7r[exp(-Ai?)]| = -/ A 7r exp (_ /9fi )(-R)^. 



> sup p{-A/j(r) -X(p,ir)\ 



These remarks applied to Theorem 11.2.11 lead to 

Theorem 1.2.4. For any posterior distribution p : O — > M^S), /or any positive 
parameter \, 



¥[p(R)] < 



1 - exp |-i J Q A n CKp ^p R) (R)d(3 - ^P[3C(p, 7r oxp( „ Ar) )] j 



l-exp(-A) 



< N ^ _ e xp(-A)] {lo n <*p<.-PR)( R ) d P + P[3C(p, 7T cxp( „ Ar) )] }. 

This theorem is particularly well suited to the case of the Gibbs posterior distri- 
bution p = 7r Gxp (-_ Ar ), where the entropy factor cancels and where P[7r exp (_ AT .)(i?)] 
is shown to get close to infe R when N goes to +00, as soon as X/N goes to while 
A goes to +00. 

We can elaborate on Theorem 1 1 . 2 . 41 and define a notion of dimension of (0, i?), 
with margin 77 > putting 

(1.5) 0^(6, sup /3[7r eX p(-^R)(-R) -ess inf i?-ry] 

/3eR+ * 

< -log|7r[i? < essinf + 
This last inequality can be established by the chain of inequalities: 

P^ C x P (-0R)(R) < Jo 7 T C x P (-yR)(R)d'y = -log|7r[exp(-/3i?)]| 

< f3 {ess inf R + — log 7r(i? < ess inf R + r/) , 

where we have used successively the fact that A 1— > 7r oxp (_ Afi .) (R) is decreasing 
(because it is the derivative of the concave function A — log{7r[exp(— XR)] }) 
and the fact that the exponential function takes positive values. 
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In typical "parametric" situations do(Q, R) will be finite, and in all circumstances 
d v (Q, R) will be finite for any r\ > (this is a direct consequence of the definition 
of the essential infimum). Using this notion of dimension, we see that 



Jo 



K cxp (-pR)(R)df3 < A(essinf R + rj) 



-TT A (1 - essinf R - rj) 



A(essinf R + rj) + d n (Q, R) log 



eA 



d v (S,R) 



d/3 

(l — ess inf R — rn 



This leads to 



Corollary 1.2.5 With the above notation, for any margin rj G R + , for any poste- 
rior distribution p : O — > (0) 7 



P\p(R)] < inf -Is 1 

lHV !i ~ AGR+ Tr 



■ r n d n /eA 

ess ml R + rj + — - log — 

7r A \ dj) 



»{DC[p,7T. 



exp(— Ar) 



} 



A 



If one wants a posterior distribution with a small support, the theorem can also 
be applied to the case when p is obtained by truncating 7r oxp (_ Ar ) to some level 
set to reduce its support: let 6 p = {8 £ : r{9) < p}, and let us define for any 
q e)0, 1) the level p q = ird{p : 7r e xp(-Ar)(@j>) > q}, let us then define p q by its 
density 

d Pq , m i(eee P ,) 



c/tt, 



exp( — Ar) 



-(e) 



7r cxp(-Ar)(9pJ ' 

then po = 7r cxp (-Ar) and for any q E (0, 1(, 

1 - exp | -i /* 7r exp (_^ H ) (R)d0 - j 



l-exp(-A) 



< 



iV[l-cxp(-A)] 



{lo 7r cK P (-i3R)(R)df3 - log(g)}. 



1.2.4- Deviation bounds 

They provide results holding under the distribution P of the sample with probability 
at least 1 — e, for any given confidence level, set by the choice of e s)0, 1(. Using them 
is the only way to be quite (i.e. with probability 1 — e) sure to do the right thing, 
although this right thing may be over-pessimistic, since deviation upper bounds are 
larger than corresponding non-biased bounds. 

Starting again from Theorem 11.1.41 (page 2]), and using Markov's inequality 
P[exp(/i) > l] < P[exp(ft,)], we obtain 

Theorem 1.2.6. For any positive parameter A, with P probability at least 1 — e, 
for any posterior distribution p : fi — > (0), 



p(R)<^{p(r) + 3CM - l ° gi£) } 
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1 — exp 



Xp(r) 3C(p,7r)-log(e) 



N 



N 



- AT[l-exp(-^)] . 



l-exp(-A) 

~ p(r) + 3C(p,^)-log(e) 
A 



We see that for a fixed value of the parameter A, the upper bound is optimized 
when the posterior is chosen to be the Gibbs distribution p — 7r cxp (_ Ar ). 

In this theorem, we have bounded p(R) , the average expected risk of an estimator 
9 drawn from the posterior p. This is what we will do most of the time in this study. 
This is the error rate we will get if we classify a large number of test patterns, 
drawing a new 9 for each one. However, we can also be interested in the error rate 
we get if we draw only one 9 from p and use this single draw of 9 to classify a 
large number of test patterns. This error rate is R(9). To state a result about its 
deviations, we can start back from Lemma II. 1.11 (page [3]) and integrate it with 
respect to the prior distribution 7r to get for any real constant A 



expi 



{A [*£(*)-!■]} 



< 1. 



For any posterior distribution p : fi — » 3Yti(0), this can be rewritten as 



exp{A[$A(i?)-r] -log(j£)+log(e)]} 



proving 



Theorem 1.2.7 For any positive real parameter \, for any posterior distribution 
p : Q, — > M^(O), with Pp probability at least 1 — e, 



R(0) < + A -1 log(e 



< 



A 



7V[l-cxp(-A)] 



-idp 
dn 



r(9) + \- 1 log e 



-idp 
dn 



Let us remark that the bound provided here is the exact counterpart of the bound 
of Theorem 1 1 . 2 . 61 since log(^) appears as a disintegrated version of the divergence 
3C(/3, 7r). The parallel between the two theorems is particularly striking in the special 
case when p = 7r oxp (_,\r)- Indeed Theorem 11.2.61 proves that with P probability at 
least 1 — e, 



7To X p(-Ar)(-R) < 



log{7r[exp(-Ar)] } + log(f 



whereas Theorem 11.2.71 proves that with P7r CX p(-Ar) probability at least 1 — e 
R(6) < fr-i ( log ^ [ ex P(~ Ar )] } + lp g( £ ) 

_ Tv [ A 

showing that we get the same deviation bound for TT cxp (~\ r ) (R) under P and for 9 



under Pit, 



exp( — Ar) ■ 
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We would like to show now how to optimize with respect to A the bound given 
by Theorem 11.2.61 (the same discussion would apply to Theorem I1.2.7P . Let us 
notice first that values of A less than 1 are not interesting (because they provide a 
bound larger than one, at least as soon as e < exp(— 1)). Let us consider some real 
parameter a > 1, and the set A = {a k ; k £ N}, on which we put the probability 
measure v(a k ) = [(k+l)(k+2))~ 1 . Applying Thcorcm ll.2.6l to A = a k at confidence 
level 1 — (k+i)(k+2) ' anc ^ using a union bound, we see that with probability at least 
1 — e, for any posterior distribution p, 



p(R) < inf $- 

A'GA - 



p{r) 



3C(p,^)-log( e ) + 21og 



log(« 2 V) 
log(a) 



Now we can remark that for any A G (l,+oo(, there is A' G A such that a _1 A < 
A' < A. Moreover, for any q £ (0, 1), /3 i— > ^^(q) is increasing on M+. Thus with 
probability at least 1 — e, for any posterior distribution p, 

* R) * x$u*i { p{r) + 1 i xM - iog(e) +2i °< i w)] } 

1 - exp {-A p(r) _ * [ X (p,n) - log(e) + 2 log(^gf )] } 



= inf 

Ae(i,oo( 

Taking the approximately optimal value 

A = 

we obtain 



1 - cxp(-A; 



2Na [3C(p,7r)-log(e)] 
p(r)[l-p(r)} 



Theorem 1.2.8. With probability 1 — e, for any posterior distribution p : Q 
M^(8), putting d{p, e) = X(p, n) - log(e), 



1 — exp 



p(R) < inf ■ 

feeN 



.^p(r)_I[d(p jC )+log[(A + l)(A + 2)] 



1 — exp — 



1 — exp < — n 



' 2ap(r)d(p, e) a 
AT[1 - p(r)] ~ 77 



iV 



e) + 2 log 



,12 2Nad(p,e) \ 



log(a) 



< 



1 — exp 



2ad(p, e) 



tfp(r)[l-p(r)] 



Moreover with probability at least 1 — e, for any posterior distribution p such that 
p(r)=0, 

X(p,7r)-log(c)" 



p(-R) < 1 — exp 



N 



We can also elaborate on the results in an other direction by introducing the 
empirical dimension 

(1.6) d e = sup ft Wexpf—pr) (r) — ess inf r] < — logWr = essinf r)l. 

/3GR+ w ^ 



14 



Chapter 1. Inductive PAC-Bayesian learning 



There is no need to introduce a margin in this definition, since r takes at most N 
values, and therefore n(r = essinf,,- r) is strictly positive. This leads to 

Corollary 1.2.9. For any positive real constant A, with P probability at least 1 —e, 
for any posterior distribution p : O — > M^(0) 7 

p{R) < $± 

N 

We could then make the bound uniform in A and optimize this parameter in a 
way similar to what was done to obtain Theorem ll.2.81 



ess inf r 



log 



%[p,7T. 



exp( — Ar) J 



log(e) 



1.3. Local bounds 

In this section, better bounds will be achieved through a better choice of the prior 
distribution. This better prior distribution turns out to depend on the unknown 
sample distribution P, and some work is required to circumvent this and obtain 
empirical bounds. 



1.3.1. Choice of the prior 

As mentioned in the introduction, if one is willing to minimize the bound in ex- 
pectation provided by Theorem 11.2.11 (page , one is led to consider the optimal 
choice 7r = P(p). However, this is only an ideal choice, since P is in all conceivable 
situations unknown. Nevertheless it shows that it is possible through Theorem 1 1.2. II 
to measure the complexity of the classification model with P{3C[p, P(p)] }, which is 
nothing but the mutual information between the random sample (Xi,Yi)^L 1 and 
the estimated parameter 9, under the joint distribution Pp. 

In practice, since we cannot choose tt = P(/3), we have to be content with a 
flat prior 7r, resulting in a bound measuring complexity according to P[3C(/?, tt)] = 
¥{% [p, P(p)] } +X [P(p), tt] larger by the entropy factor % [P(p), tt] than the optimal 
one (we are still commenting on Theorem ll.2.ip . 

If we want to base the choice of tt on Theorem 1 1.2. 41 (pagc[T0|). and if we choose 
P = ^cxpf-Ar) to optimize this bound, we will be inclined to choose some 7r such 
that ^ 

jJoK cxp (-p R )(R)d(3 = --log{7r[exp(-Ai?)] j 

is as far as possible close to infgge R(9) in all circumstances. To give a more specific 
example, in the case when the distribution of the design (Xi)fL 1 is known, one can 
introduce on the parameter space & the metric D already defined by equation 
(|1.3[ page [7]) (or some available upper bound for this distance) . In view of the fact 
that R{9) - R(9') < D(9,9'), for any 9, 9' € 6, it can be meaningful, at least 
theoretically, to choose tt as 



1 

TTfc, 



where tt^ is the uniform measure on some minimal (or close to minimal) 2 _fc -net 
K(e,D,2- k ) of the metric space (0,-D). With this choice 
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ilogjTrfexpC-Afl)]} < inf R{6) 



M)2 - k , iog(|^(e,A2- fc )|) + iog[fc(fc + i)] 



k [ A 

Another possibility, when we have to deal with real valued parameters, meaning 
that C M. d , is to code each real component 6i G M of 9 = (Oi)f =1 to some 
precision and to use a prior /i which is atomic on dyadic numbers. More precisely 
let us parametrize the set of dyadic real numbers as 

: se {-i,+l},meZ,peN,^ e {0,1} j, 

where, as can be seen, s codes the sign, m the order of magnitude, p the precision 
and (&j)j =1 the binary representation of the dyadic number r [s, m,p, (bj)^ =1 \ . We 
can for instance consider on D the probability distribution 

(1.7) ti{r[s,m,p,(b j ) p j=1 ]} = [3(|m| + l)(|m| + 2)(p + l)(p + 2)2P 

and define it € M]j_(R d ) as 7r = This kind of "coding" prior distribution can 
be used also to define a prior on the integers (by renormalizing the restriction of 
p to integers to get a probability distribution). Using p is somehow equivalent to 
picking up a representative of each dyadic interval, and makes it possible to restrict 
to the case when the posterior p is a Dirac mass without losing too much (when 
= (0, 1), this approach is somewhat equivalent to considering as prior distribution 
the Lebesgue measure and using as posterior distributions the uniform probabil- 
ity measures on dyadic intervals, with the advantage of obtaining non-randomized 
estimators). When one uses in this way an atomic prior and Dirac masses as pos- 
terior distributions, the bounds proven so far can be obtained through a simpler 
union bound argument. This is so true that some of the detractors of the PAC- 
Bayesian approach (which, as a newcomer, has sometimes received a suspicious 
greeting among statisticians) have argued that it cannot bring anything that ele- 
mentary union bound arguments could not essentially provide. We do not share of 
course this derogatory opinion, and while we think that allowing for non atomic 
priors and posteriors is worthwhile, we also would like to stress that the upcoming 
local and relative bounds could hardly be obtained with the only help of union 
bounds. 

Although the choice of a flat prior seems at first glance to be the only alternative 
when nothing is known about the sample distribution P, the previous discussion 
shows that this type of choice is lacking proper localisation, and namely that we 
loose a factor 3C{P[7r exp (_^ r )] , 7r}, the divergence between the bound-optimal prior 
IP > [' 7r exp(-Ar)] * which is concentrated near the minima of R in favourable situations, 
and the flat prior n. Fortunately, there are technical ways to get around this diffi- 
culty and to obtain more local empirical bounds. 

1.3.2. Unbiased local empirical bounds 

The idea is to start with some flat prior 7r G Mi_(0), and the posterior distribution 
P = 7r cxp(-Ar) minimizing the bound of Theorem 11.2.11 (page l6|) . when 7r is used as a 
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prior. To improve the bound, we would like to use P[7r exp (_xr)l instead of tt, and we 
are going to make the guess that we could approximate it with 7r oxp (_^) (we have 
replaced the parameter A with some distinct parameter (3 to give some more freedom 
to our investigation, and also because, intuitively, P[7r cxp may be expected to 

be less concentrated than each of the Tr exp (~\r) it is mixing, which suggests that the 
best approximation of PfTTcxp^^,,)] by some TT exp (-0R) may be obtained for some 
parameter (3 < A). We are then led to look for some empirical upper bound of 
X[p, 7T C xp(-/3fl)] ■ This is happily provided by the following computation 

P{DC[p,7r« p( _ / 3 Jl) ]} =P[DC(p,7r)] +0P[p(R)] + log{7r[exp(-/3i?)] } 
= F{% [p, 7 r exp( _ /3r) ] } + (3F[p(R - r)] 

+ log{7r[exp(-/3i?)] } - p{log7r[exp(-/?r)] }. 

Using the convexity of r i— > log{7r[exp(— /3r)l } as in equation (11. 4p on page QUI we 
conclude that 

0<P{X[p,TT exp ^ m ]}<pF[p{R-r)] + P{3C[p, 7 r exp( _ /3r) ]}. 

This inequality has an interest of its own, since it provides a lower bound for 
P[p(i?)] . Moreover we can plug it into Theorem ll.2.11 (page [5]) applied to the prior 
distribution 7r C xp(- / 3_R) an d obtain for any posterior distribution p and any positive 
parameter A that 

$a |P[p(i?)] } < P|p(r) + ^p(R -r) + jF{x[ p , ^ cxp( _ M ] 

In view of this, it it convenient to introduce the function 

$«,l(p) = (l-&) _1 [*«(p)-&p] 

= -(1 - b)- 1 ^- 1 log{l -p[l - exp(-a)] } + 6p}, 

p G (0,1), a e)0,oo(,6 G (0,1(. 

This is a convex function of p, moreover 

Km = K 1 1 1 - ex p(-«)] - h Y i - 

showing that it is an increasing one to one convex map of the unit interval unto 
itself as soon as b < a -1 [l — exp(— a)] . Its convexity, combined with the value of 
its derivative at the origin, shows that 

~ , , a -1 [l — cxp(— a)l - b 
*.,&(?) > [ 1 _ b J P- 

Using this notation and remarks, we can state 

Theorem 1.3.1. For any positive real constants (5 and A such that < < N[l — 
exp(— -4)], for any posterior distribution p : f2 — » M^_(0), 

vUr)- %[P >y- 0r)] }<F[p(R)] 
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< $ 



A_ £ 

N ' A 



p(r) 



7Tcxp(-/3r)] 



A-/3 



< 



A-/3 



AT[l - exp(-A)] _ ^ 



p(r) 



exp(-/3r)J 



A-/3 



T/iws (taking X = 2f3), for any (3 such that < (3 < -j 



1 at L 



/3 



Note that the last inequality is obtained using the fact that 1 — exp(— a;) > x— 
x 6 K+. 

Corollary 1.3.2. For any (3 G (0, N{, 



P[7Texp(-/3r)W] < ]P [ 7r oxp(-/3r)(-R)] 



< 



inf 



A-/3 



Ae(-JViog(i-4),oc( iV[l - exp(-4)] - 13 



P[7T eX p(_ ;3r )(r)] 



< 



— 2?J P [ 7r cxp(-/3r)W]- 
1 JV 



t/ie last inequality holding only when (3 < 

It is interesting to compare the upper bound provided by this corollary with 
Theorem 11.2.11 (page [6]) when the posterior is a Gibbs measure p = ^ C xp{-i3r)- We 
see that we have got rid of the entropy term % [7Texp(-/3r)i > but at the price of 
an increase of the multiplicative factor, which for small values of grows from 



(1 - (when we take A = j3 in Theorem [TT2~T]l . to (1 - Therefore 

non-localized bounds have an interest of their own, and are superseded by localized 
bounds only in favourable circumstances (presumably when the sample is large 
enough when compared with the complexity of the classification model). 

Corollarv l 1 . 3 . 21 shows that when ^ is small, it c ->tp{-f3r)( r ) is a tight approximation 
of 7r e xp(-/3r) (R) in the mean (since we have an upper bound and a lower bound which 
are close together). 

Another corollary is obtained by optimizing the bound given by Theorem 11.3.11 
in p, which is done by taking p = TT cxp ^\ r y 

Corollary 1.3.3. For any positive real constants (3 and A such that < [3 < 
JV[l-exp(-A)], 



P[tt, 



exp( — Ar) 





N ' A 



1 



< 



7Toxp(- 7 r)(r)d7 
1 



N[l - exp(-A)] _ 



/ /3 A 7roxp(- 7 r)(?')d7 



Although this inequality gives by construction a better upper bound for 
inf>,eR + E s [ 7r exp(-Ar)(-S)] than Corollary 11.3.21 it is not easy to tell which one of 
the two inequalities is the best to bound P[7T exp (_ Ar )(i?)] for a fixed (and possibly 
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suboptimal) value of A, because in this case, one factor is improved while the other 
is worsened. 

Using the empirical dimension d e defined by equation (|1.6p on page[13j we see 
that 



A 



3^ J 7i" CX p(- 7 r) {r)dry < ess inf r + d e log (j^J 



Therefore, in the case when we keep the ratio 4 bounded, we get a better depen- 
dence on the empirical dimension d e than in Corollarv ll.2.91 fpage fTi)) . 

1.3.3. Non random local bounds 

Let us come now to the localization of the non-random upper bound given by The- 
orem [03] (page [10]). According to Theorem 11.2.11 (page l6|) applied to the localized 
prior 7T exp (_^ fl ), 

A$ , {F[p(R)] } < P{\p(r) + %(p, tt) + f3p(R)} + log{^[exp(-/3i?)] } 

- f{x[ P , 7r oxp( _ Ar) ] - log{7r [cxp(- Ar)] } + [3p(R) } + log{7r [exp(-pR)] } 

< p{DC[p, 7r exp( _ Ar) ] + pp{R)} - log{7r[exp(-Ai?)] } + log{^[exp(-/3i?)] }, 

where we have used as previously inequality (jl.4p (page 1 1 0|) . This proves 

Theorem 1.3.4. For any posterior distribution p : Cl — > Mi_(0), for any real 
parameters (5 and A such that < (3 < Nil — exp(— ■j^)] , 



N ' A 



< 



AT[l-exp(-A)] -(3 







A-/3 

7r xp(- 7 _R)(^)rf7 + ^{^[P, 7Tcxp(-Ar)] } 



Let us notice in particular that this theorem contains Theorem 11.2.41 (page [TO]) 
which corresponds to the case (3 = 0. As a corollary, we see also, taking p = 7r cxp (_ Ar ) 
and A = 2/3, and noticing that 7 1— > 7r cxp( -_ 7fi ) (i?) is decreasing, that 

P[7T exp( _ Ar) ( J R)] < inf — -7Texp(-/3H)(-R) 

/8,/J<JV[i-exp(-&)] i\T[l -exp(-f )J -/3 
^ T^X 7r cx P (-A 7?) (i?). 

We can use this inequality in conjunction with the notion of dimension with margin 
n introduced by equation (|1.5p on page [TO] to see that the Gibbs posterior achieves 
for a proper choice of A and any margin parameter 77 > (which can be chosen to 
be equal to zero in parametric situations) 

(1.8) infP[^ oxp( _ Ar) (i?)] <essinfi? + ?/ + ^ 



+ 21 



' 2d v (ess inf R + rj) 

N + W' 



1.3. Local bounds 



19 



Deviation bounds to come next will show that the optimal A can be estimated from 
empirical data. 

Let us propose a little numerical example as an illustration: assuming that 
da = 10, N = 1000 and essinf^i? = 0.2, we obtain from equation (|1.8j) that 
mf A P[7r exp( _ Ar) (#)] < 0.373. 

1.3-4- Local deviation bounds 

When it comes to deviation bounds, for technical reasons we will choose a slightly 
more involved change of prior distribution and apply Theorem 11.2.61 (page [TT]) to 
the prior Tr eKp \-p$ p r] - The advantage of tweaking R with the nonlinear function 

~~ 77 

will appear in the search for an empirical upper bound of the local entropy 

N 

term. Theorem 11.1.41 (page H|. used with the above-mentioned local prior, shows 
that 

(1.9) pJ sup A{p(*AoiZ)-p(r)}-X[p,7re«pH» \ < 1- 
Moreover 

(1.10) X[p,ir e x P [-p<s,_ r]] =X[p,ir exp (-i3 r )] + j3p $_0_oi?-r 

+ log|7r exp(— oRj | — logjy exp(— fir) j, 

which is an invitation to find an upper bound for log|7r exp[— a, o i?] | — 

log|7r[exp(— /3rj\ j- For conciseness, let us call our localized prior distribution 7r, 
thus defined by its density 

^ = exp{-/3d>_ # [i?(fl)]} 
dn 7r|exp[-/3$_^ oR]\ 
Applying once again Theorem 1 1.1. 41 (page[4j, but this time to —j3, we see that 



(1.11) P^exp 



log|7r exp(— /3$_ £_ oi?) | — log|7r[exp(— f3r)] j 



exp 



log(7r[exp(-/3$_^ oi?))j 1 + inf /3p(r) + X(p, n) 
I l y n '!) P eM\(8) 



< 



"jexp log|7r exp(-/3$_ iL oi?)) | + f3n(r) + X(n, n 

W(r) -W($_f> oR) +X(tt,W) 



ft 



< 1. 



Combining equations (|1.10[) and p. lip and using the concavity of <!>_ p_ , we see that 
with P probability at least 1 — e, for any posterior distribution p : O — > Mi_(0), 

0<3C(p,tt) <3C[p,7Toxp(-/3r)] + [p(ii)] -p(r)] -log(e). 

We have proved a lower deviation bound: 
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Theorem 1.3.5 For any positive real constant (3, with P probability at least 1 — e. 
for any posterior distribution p : fl — > Mi_(0), 



exp 



p(r) 



(3 



1 



ex P(£) - 1 



< p{R). 



We can also obtain a lower deviation bound for 0. Indeed equation p. lip can 
also be written as 



^*exp(— /3r) 



exp 



r - <I> s o R 



< 1. 



This means that for any posterior distribution p : il — > Mi (O), 



•{ P [exp{/J[r-$. r fl] -log(^5^)}]} 



< 1. 



We have proved 



Theorem 1.3.6 For any positive real constant (3, for any posterior distribution 
p : fl — > Mi. (O), with Pp probability at least 1 — e, 



^ log(^— )-log(e)] 







J/3 

exp \iv 


r(0 


^ log( 














exp 







Let us now resume our investigation of the upper deviations of p(R). Using the 
Cauchy-Schwarz inequality to combine equations (|1.91 pagefl9|) and (jl.lH page [19)1. 
we obtain 



(1.12) 



P<^ exp 



- sup Ap($A oi?) -/3p($0oR) -(\-/3) p (r) - X[p,7r eX p(-/3r)] 



x exp 



2 sup 



< P<^ exp 



A{p($Aoi?)-p(r)}-aC(p,^) 
|^log{-7r exp(-/3$_ p_oR) |-log|7r exp(— j3r) } 
\{ P (<S>±oR) _ p (r)} -3C(p,7f) 



sup 



1/2 



x _'''<j cx.i> [ log j-zr exp(— oi?) | — log|?r exp(— /3r) | 
Thus with P probability at least 1 — e, for any posterior distribution p 
A$a [p(fl)] - [p(i?)] 



1/2 



< 1. 



1.3. Local bounds 



21 



< Ap($ x o R) - 0p(<f>_f> o R) 

< (A - /3)p(r) + 3C(p, n eM _ M ) - 2 log(e). 

(It would have been more straightforward to use a union bound on deviation in- 
equalities instead of the Cauchy-Schwarz inequality on exponential moments, any- 
how, this would have led to replace — 21og(e) with the worse factor 21og(-).) Let 
us now recall that 

A$, (p) -/?$_£ (p) = -iVlogjl - [1 -exp(-A)]p} 

-iVlog{l+ [exp(£) -l]p}, 

and let us put 

B = (A - /3)p(r) + DC[p, 7r exp( _^ r) ] - 2 log(e) 

= K[p,7T oxp( _ Ar) ] + j^cxp^r)^)^ ~21og(e). 

Let us consider moreover the change of variables a = 1— exp(— -^) and 7 = exp(-^) — 
1. We obtain [l - ap(R)] [l + jp(R)] > exp(--f), leading to 

Theorem 1.3.7. For any positive constants a, 7, such that < 7 < a < 1, with P 
probability at least 1 — e, /or any posterior distribution p : fl — > M+(0), i/ie bound 

M ,, = log[(l-a)(l+7)] ^ ( .^ + ^(p,7rc x P [-iviog(i+7)r]) - 21og(e) 



a — 7 

K[p, ""expIAf log(l-a)r]] 



-JVlog(l-a) 



iVlog(l+7) 



7V(a - 7) 
7Tcx P (-,5r)(r-)d£ - 21og(e) 



7V(a - 7) 



is such that 



p{R) < 



a — 7 
2«7 



4a7 



(a - 7 ) ; 



.{l- eX p[-(a- 7 )M(p)]}-l < M(p), 



Let us now give an upper bound for R(9). Equation (|1.12| page l2"0|) can also be 
written as 



i"exp(-/3r){exp o R - /3§_jj_ o R - (A - (3)r j 
This means that for any posterior distribution p : il — > Mi_(8), 

p{exp[A$ , o i? - , o i? - (A - (3)r - ]og{^-^)] } 



< 1. 



< 1. 



Using the concavity of the square root function, this inequality can be weakened 



to 



P{p[exp{i [A* , o R - /?$_ # o R - (A - ff)r - log{^-^)] }] j 



< 1. 



We have proved 
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Theorem 1.3.8. For any positive real constants A and (3 and for any posterior 
distribution p : fi — > Mi_(0), with Pp probability at least 1 — e, 



A$ A [#(#)] - s [R(0)] < (A - /3) r( 



log 



Putting a — 1 — exp( 



AT, 



7 = exp(-^-) — 1 and 



M(0) 



log[(l-a)(l+ 7 )] ^8 
— r^J H 



dp 



l" , E xp[-Klog(l+ 1 )r] 



') -21og(e). 



(0) -21og(e) 



log 



a — 7 

dp 



^^cxp [JV log ( 1 - a ) r 



-JVlog(l-a) 



JV log(l+ 7 ) 



iV(a - 7) 

K"exp(-£r) M - 2 log(e) 



AT(a - 7) 

we can also, in the case when 7 < a, write this inequality as 



R{0) < 



a — 7 
2c«7 



'l + , 4a7 )2 {l - exp[-(a - 7 )M(?)] } - 1 ) < M(0). 



(a — 7) 2 



It may be enlightening to introduce the empirical dimension d e defined by equa- 
tion (II. 6|) on page [T3l It provides the upper bound 

J 7r eX p(-er)( r )^ < (A-/3)essinfr + d e log , 
which shows that in Theorem 11.3.71 (page l2"Tj) , 



log[(l + 7 )(l-a)] . 
M (pj < - ess ml r 



7 — a 



d e log 



- log(l-q) 
log(l+ 7 ) 



^[/3)7r e xp[jVlog(l-a)r]] _ 2 log(e) 



7V(a - 7) 



Similarly, in Theorem 11.3.81 above. 



M/flx ^ log [(1+ 7) (!-")] . , 
M(0) < - ess ml r 



7 — a 



d e log 



- logfl— a) 



log(l+7) 



log 



dp 



^cxpfiV log(l-a)r] 



(0) -21og(e) 



iV(a - 7) 

Let us give a little numerical illustration: assuming that d e = 10, N — 1000, and 
ess ^1^7- = 0.2, taking e = 0.01, a — 0.5 and 7 = 0.1, we obtain from Theorem 
IIl3Z3 ^exp[iViog(i-c0r](-K) - 7r c Xp (-693r) (-R) < 0.332 < 0.372, where we have given 
respectively the non-linear and the linear bound. This shows the practical interest 
of keeping the non-linearity. Optimizing the values of the parameters a and 7 would 
not have yielded a significantly lower bound. 

The following corollary is obtained by taking A = 2/3 and keeping only the linear 
bound; we give it for the sake of its simplicity: 
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Corollary 1.3.9. For any positive real constant (3 such that exp(-j^) +cxp(— ^) < 
2, which is the case when (3 < 0.48-/V, with P probability at least 1 — e, for any 
posterior distribution p : O — > M\_ (6), 

p(R) < +% [P^^v{-(ir)] - 21og(e) 



Ar[2-e X p(#)-exp(-f)] 



Iff 7!"exp(-^r) M*| + ^[p, 7I"exp(-2/3r)] ~ 2 l°g( e ) 

7V[2-exp(4)-exp(-f)] ' 



Let us mention that this corollary applied to the above numerical example gives 
7r oxp(-200r)(^) < 0.475 (when we take /3 = 100, consistently with the choice 7 = 
0.1). 

1.3.5. Partially local bounds 

Local bounds are suitable when the lowest values of the empirical error rate r are 
reached only on a small part of the parameter set 0. When 6 is the disjoint union 
of sub-models of different complexities, the minimum of r will as a rule not be 
"localized" in a way that calls for the use of local bounds. Just think for instance 
of the case when 6 = |_lm=i ® m > where the sets 81 C 82 C ■ ■ ■ C 8m are nested. 
In this case we will have infe x r > infe 2 r > ■ ■ ■ > infe M r, although 6m may 
be too large to be the right model to use. In this situation, we do not want to 
localize the bound completely. Let us make a more specific fanciful but typical 
pseudo computation. Just imagine we have a countable collection (® m )meM of 
sub-models. Let us assume we are interested in choosing between the estimators 
6 m G argmine m r, maybe randomizing them (e.g. replacing them with 7T^ p (_^ r ))- 
Let us imagine moreover that we are in a typically parametric situation, where, 
for some priors 7r m G Mi_(© TO ), m G M, there is a "dimension" d m such that 
•M 7r cxp(-Ar)( r ) — r (^m)] — d m - Let a G M\(M) be some distribution on the index 
set M. It is easy to see that (/x7r) cxp (_ Ar ) will typically not be properly local, in the 
sense that typically 

M{7i"ex P (-Ar)W7r[exp(-Ar)] } 

(A"r)cxp(-Ar)M = 7— 

]T [Qnf r) + fy] exp[-A(inf r) - d m log(£)] a(m) 

m£M 



m£M 



^2 ex P - X (™fr)-d m \og(j±) M 



TO 



Jnf w (mfr) + ^log(£)-Ilog[ M (m)] 

+ iogj exp[-d m iog(g^)]M m ) >■ 



where we have used the approximations 

- log|7r[exp(-Ar)] j = J 7r cxp( „ (3r) (r)d/3 
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Ainf r) + A 1] d/3 ~ A(inf r) + d m [log(£) + l] , 

JO H» »rr> 



and LjHgi fe(m)]^(m) ^ h{m) _ XogHm)l v £ M i (M)j taking!y(m) = 
Z^m exp[— /i(m)Ji/(m) m 
jLt(m) exp[-<2 m log ( ^-)] 



Em' /*("*') exp[-d ro / Iog(^7)] ' 

These approximations have no pretension to be rigorous or very accurate, but 
they nevertheless give the best order of magnitude we can expect in typical situa- 
tions, and show that this order of magnitude is not what we are looking for: mixing 
different models with the help of p spoils the localization, introducing a multiplier 
log(^p-) to the dimension d m which is precisely what we would have got if we had 
not localized the bound at all. What we would really like to do in such situations is 
to use a partially localized posterior distribution, such as 7r" X p(_> r ), where m is an 
estimator of the best sub-model to be used. While the most straightforward way to 
do this is to use a union bound on results obtained for each sub- model O m , here 
we are going to show how to allow arbitrary posterior distributions on the index 
set (corresponding to a randomization of the choice of m) . 

Let us consider the framework we just mentioned: let the measurable parameter 
set (6, T) be a union of measurable sub-models, 8 = UmeAf ® m- ^ e index set 
(M, M) be some measurable space (most of the time it will be a countable set). Let 
p £ M]_(M) be a prior probability distribution on (M, M). Let tt : M — > M^(9) be 
a regular conditional probability measure such that 7r(m, m ) = 1, for any m £ M. 
Let fiir £ JA\{M x 0) be the product probability measure defined for any bounded 
measurable function h : M x — > R by 



jm{h) = I I / h(m, 0)7r(m, d6) ) /x(rfm). 

For any bounded measurable function h : fl x M x — ► M, let 7r exp (ft,) : CI x M 
M]j_(0) be the regular conditional posterior probability measure defined by 

dTT cxp ( h ) , m exp[h(m,9j] 
-{m, 0) — 



dir ' 7t[to, exp(/i)] ' 

where consistently with previous notation 7r(m, h) — J & h(m, 0)7r(m, d9) (we will 
also often use the less explicit notation 7r(/i)). For short, let 

17(0, w) = [R(6)] - (3<^_^ [R{6)) - (A - p)r(6,w). 

Integrating with respect to p equation (|1.12[ page [20]) . written in each sub-model 
m using the prior distribution 7r(m, •), we see that 



exp 



sup sup — 

.veWV, (M) p-.M^M 1 , (©) 2 



(vp){U) - !/{3C([/3, ir exp (-p r ) ]}\-X(v, p) 



< Fi exp 



Sup -v[ SUp p(U) - 3C(p,7T e xp(-/3r)) ) - 0C(v, fj,) 



v£M\{M) z \p:M-> M\(Q) 



cxp< i sup 



= F</X 



p(U) - X[p,Tr eKp ^p r )] | 
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= A* 



cxp-^ i sup 



p(U) - 3C[p,7r eX p(_ /3r )]j I j < 1 



This proves that 



(1.13) P^exp 



- sup sup vp[\$±(R) — (3$_0_(R)] 

2 veM\ (M) p:M->M|(e) ™ 

- (A - /?)i/p(r) - 20C(i/, p) - i/{3C[p, ^*cxp(— /3r)] } 



< 1. 



Introducing the optimal value of r on each sub-model r*{m) = essinf w ( m) .) r and 
the empirical dimensions 

d e (m) = sup ^[n e ^ p /_M(m,r) - r*(m)], 

we can thus state 

Theorem 1.3.10. For any positive real constants (5 < X, with P probability at least 
1 — e, for any posterior distribution v : ft — > M\_(M), for any conditional posterior 
distribution p : £1 x M — > M+(9), 

i/p[A*A(fl) (#)] < A$ A [vp{R)] -p®_±[vp(Rj\ <Si(i/,p), 



w/iere Bx{v,p) = (A — (3)vp{r) + 2X(u,p,) + v{3C[p,n exp (_ 0r) ] } - 21og(e) 



/ T e xp(-ar)(r) da 
J/3 



23C(^, p) + ^{OCfp, 7r cxp( _ Ar) ] } - 2 log(e) 



-21og<U 



CXP ~9 / 7I "c X p(-ar)(' , )rfa 



/ ^[cxp(-Ar)] \ 
l-["P(-|3'-)l ) 



1/2] + v{X[p, 7r oxp( _ Ar) ] } - 2 log(e), 



and therefore B \ (u, p) < v (A - (3)r* + log(^) d e + 2X{v, p) 

+ v{X[p, 7r oxp( _ Ar) ] } - 2 log(e), 

as well as B x {v,p) < -21ogjp cxp^-^^r* - |log(|)d e ^ | 

+ 2X[v,p, _ y/2] + ^{3C[/?,7r cxp (-Ar)] -21og(e). 

V Jr[cxp(-07-)l / 

Thus, for any real constants a and 7 such that < 7 < a < 1, wit/i P probability 
at least 1 — e, /or any posterior distribution v : O — > M^(M) and any conditional 
posterior distribution p : O x M — > M^_(6), i/ie bound 

_ log[(l-a)(l+ 7 )] ^ + 2JC(^ M )+ 1 .{jc[p,7r (1+T) - JVr ]}-21og( e ) 



B 2 {v,p) = - 



JV(a— 7) 



7V(a - 7) 



23C 



i/,p 



\^[(i+ 7 )-« n y 



1/2 



+ ^|3C[p,7T (1 _ a) JVr]| 
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N(a - 7) 







■ 1 /•- JVlog(l-a) 




exp 


— / 7r cxp(-Jr 






L z JjVlog(l+ 7 ) 



21og(e) 
iV(a - 7) 



satisfies 



4 "" {l-exp[-(a-7)B a (i/,p)]} 



(a -7) 2 



<B 2 (y,p), 



If one is willing to bound the deviations with respect to Pi^p, it is enough to 
remark that the equation preceding equation (|1.13l page !25[) can also be written as 



*~exp( — f3r) 



1/2 



< 1. 



exp(A$ A o R - o R - (A - (3)r) 

L N N J 

Thus for any posterior distributions v : O — > M+(M) and p : f2 X M — > M+(9), 

j p exp{A$A o i? - <s ° i? 
I L N « 

-(A-/3)r-21og(^)-log( s ^)}]} 



1/2 



< 1. 



Using the concavity of the square root function to pull the integration with respect 
to p out of the square root, we get 

jexp ^|a$a o R - fi_ o R 

-(A-/3).-21og(^)-log(^-^)} 



< 1. 



This leads to 

Theorem 1.3.11. For any positive real constants (3 < A, for any posterior distri- 
butions v : fl — > Mi_(Af) and p : fi X M — » M,\_ (9), with Pup probability at least 

A$ a. [i?(m, )] - /3<I>_ ,3 [R(m, 9)] < (A - /3)r(m, 0) 

+ 21og[^(m)] +l g[^-L(m^)] -21og(e) 



21og[^(m)] +log[ 3s -^ T (m,e)] -21og(e) 



2 log p 



exp^-i y 7r cxp( _ QI , ) (r)da^ | 
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+ 21og[ 



dj.i 



T[cxp 

T[CXP 



r7;(-)]+ lo 4 3 ^7(-^)]-2iog( e ). 



Another way to state the same inequality is to say that for any real constants a and 
7 such that < 7 < a < 1, with fvp probability at least 1 — e, 



R(m,0) 



< 



a — 7 
2aj 



'1 + 



4a7 



(a - 7 ) ; 



;{l-exp[-(a- 7 )B(TM)]}-l) 



< B(m,6), 



where 



jj. log[(l-a)(l+ 7 )] 2" 
B(m,0) = ^ -r{m,0) 



a — 7 



2 log 



+ ■ 



log 



d7T 



(i+t)- 



(m,0) -21og(e) 



7V(a - 7) 



N{a - 7) 



log 



dp / \ 1/2 

U[(l+7)-" r l / 



(to) 



+ ■ 



log 


~ dp (m,6) 


-21og(e) 


N(a - 7) 





+ 



\og{ p 



iV(a-7)'"°rL CXP ( _ ^ Z 7r " p( - ar ){r)da ) 
Let us remark that in the case when v = p , x 1/2 and p = 717-, _„i» r , we 

v-[(i+t)- nt -] y 

get as desired a bound that is adaptively local in all the 9 m (at least when M is 
countable and p is atomic): 



B(v, p) < - jv (a 2 _ 7) log< M<! exp 



log[(l+7)(l-a)]r* 

10 8^ Log(l+ 7 ) ) 2 



21og(e) 
iV(a - 7) 



. _ 1 Q g [( 1 - a) ( 1 + 7 )] r 



+ log(l+ 7 ) J iV(a- 7 ) ~ 2 JV(a- 7 ) J' 



The penalization by the empirical dimension d e (to) in each sub- model is as desired 
linear in d e {m). Non random partially local bounds could be obtained in a way that 
is easy to imagine. We leave this investigation to the reader. 
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1.3.6. Two step localization 



We have seen that the bound optimal choice of the posterior distribution v on the 
index set in Theorem 11.3.101 (page |25|) is such that 



dp 
dji 



(m) 



7r[exp(— Ar(m, •))] 
7r [exp(— /3r(m, •))] 



V 




1 = exp 





7T exp (-ar)(w,r)da 



This suggests replacing the prior distribution /i with fi defined by its density 

(1.14) fgfrn) = "fWj , 
1 7 7 /i[exp(-/i)] ' 

where /i(m) = -f / 7r exp (_ a <i> oi?) [$_j7 oi?(m, •)] da. 
Jp « 

The use of $__i,oi? instead of R is motivated by technical reasons which will appear 
in subsequent computations. Indeed, we will need to bound 

N " 

in order to handle %(v,JL). In the spirit of equation (|1.9[ page [75)) . starting back 
from Theorem ll.1.41 (page[4]) , applied in each sub-model m to the prior distribution 
7r cxp(-7$ n oil) and integrated with respect to /J, we see that for any positive real 
constants A, 7 and 77, with P probability at least 1 — e, for any posterior distribution 
v : SI — > JA\(M) on the index set and any conditional posterior distribution p : 

(1.15) ^p(A$ a oi? - 7<£>_£ oi?) < A^p(r) 

+ !/X(/3,7r)+X(i/,/l) + i/|log 7r[exp(-7$_^. oi?)] j-log(e). 

Since a; 1— > f(x) = f A$ a. — 7$_^,(a;) is a convex function, it is such that 
fix) > xf'(0) = xN{ [1 - exp(-A)] + 2 [exp(i) - 1] }. 



Thus if we put 
(1.16) 



7 



r/[l-exp(-A)] 
exp(#) - 1 



we obtain that f(x) > 0, x S R, and therefore that the left-hand side of equation 
(|1.15[) is non-negative. We can moreover introduce the prior conditional distribution 
W defined by 

dW. a , exp [-/?$_ £oi?(0)] 

— {m,0) = — = 

dir 7r|m, exp jl o _RJ I 

With P probability at least 1 — e, for any posterior distributions v : Q — » Mi_(M) 
and p : x M -> M^_(6), 
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(3vp(r) + v[X(p, tt)] = v{X[p, 7r cxp( _ /3r) ] } - v log{7r[exp(-/3r)] j 
< ^{3C[p,7r cxp( _ /3r) ] } +j3i>W(r) + v[X(ir, tt)] 

+ £[Xiu,T£)-]og(ej\+ V [X(%*)] 
= v{X[p,Ti c ^ v( _ fjr) ]} -z/|log tt [exp (-/?$_ ioi?)] I 

+ f[3C(^,/l)-log(e)]. 

Thus, coming back to equation (|1.15p . we see that under condition (|1.16|1 . with 
probability at least 1 — e, 

< (A - I3)vp{r) + v{X[p, 7r exp (_ |3r .)] } 

" p "I 
/ 7Tcx P (-«* a. *)^-* ^** +(l + fi)[0C(i/,7«)+log(|)]. 

.J/3 « J ' 

Noticing moreover that 

(A - p)vp{r) + ^{3C[p,7r oxp( _ /9r . ) ] } 



= v{X[p,ir cxpi ^ Xr) ] } + v 
and choosing p = Tr eK p(-\r)i we have proved 



TT, 



exp(-ar) 



(r)da 



Theorem 1.3.12. For any positive real constants [3, 7 and n, such that 
7 < 77 [exp (jj) — l] , defining A by condition (|1.16p . so £/ia£ 
A = — ./Vlog|l — ^ [exp(-i) — l] |, iwi/i P probability at least 1 — e, /or any posterior 
distribution v : — > M 1 ^ (M) , any conditional posterior distribution p : CI X M —* 



'exp( — _T7_ O-R) 



< f 



7T, 



exp(-ar 



)(r)da 



+ (l + |)[3C(i/,7l) + log(2)]. 



Let us remark that this theorem does not require that (3 < 7, and thus provides 
both an upper and a lower bound for the quantity of interest: 

Corollary 1.3.13. For any positive real constants [3, 7 and rj such that max{/3, 
7} < 77 [exp (-j^) — l] , with P probability at least 1—e, for any posterior distributions 
v :tt^ M\(M) andpiflxM -» M^O), 



/7 
-jviog-ri 



log{l--g.[exp( *)-!]} 



exp( — ar 



}{r)da 



(l + ^)[3C(^7l) + log(f)] 



< v 



^expf— a4> 17 o_R) 
TV 
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< v 



L 



-JVlog{l-^[exp(^)-l]} 



+ (l + |)[3C(i/,7l)+Iog(| 



Wc can then remember that 



/ 7I"c X p(-a*_ „ oil) °R)da 
J0 77 



+ 0C(i/,At) — 3C(7*,a*), 



to conclude that, putting 

(1.17) G,(a) = -JVIog{l- 2 [exp^-l]} > a, a e K+, 
and 

nis > v def exp[-ft(m)] /-t , 

(1.18) — (to) = — r -, where ft(m) = g / 7r oxp( _ Qr) (to, r)cta, 

d/i M[exp(-^)J y Gi)(/3) 

the divergence of v with respect to the local prior ~p, is bounded by 

[i-^(i + |)]ac(i/,7i) 



G„( 7 ) 



7Texp(-Qr)(r) dQ! 



/■7 

/ 7!"exp(-ar)(^)rfa 



G„( 7 ) 



n cxp (- ar )(r)da 



+ X( V , (j,) - X(JI, (i) + ^(2 + £±Z) log(f) 
- 3C(i/, a*) 



+ 



iog|^ cxp^-^ y (r)da > 



+ C(2 + ^)log(|) 



/•G„(/3) /-G„( 7 K 

y +y j^ CX p(- ar )(r)da 



+ e(2+^)log(f). 



We have proved 



Theorem 1.3.14. For any positive constants (3, 7 and r\ such that 
max{/3,7} < ry[exp(-^) — l] , with P probability at least 1 — e, /or any pos- 
terior distribution v : Q — > M^(M) and any conditional posterior distribution 
/):(lxM-t M^(e), 



- / /-G„(/3) /-G,( 7 )\ 

/ + / 7Texp(-ar)Mda 



< 



i-^(i + 0]" 1 {ac( I ,,^) 



e(2 + ^)log(|)} 
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[G„( 7 ) -7 + G„C9) -/?]r* + log( 



Pi 



7} / 



log(| 



where the local prior p, is defined by equation jl-14\ P a 9 e \SW an d the local posterior 
v and the function G v are defined by equation HI. 181 vaae \30\ ). 



We can then use this theorem to give a local version of Theorem ll.3.101 (page[ 
To get something pleasing to read, we can apply Theorem ll.3.141 with constants ft', 



7' and 77 chosen so that 



2« 



the constants appearing in Theorem ll.3.101 This gives 



1, G r] (j3') — (3 and 7' = A, where f3 and A are 



Theorem 1.3.15. For any positive real constants f3 < A and r\ such that A < 
7y[exp(-ft) — l] , with P probability at least 1 — e, for any posterior distribution 
v : f2 — ► 3\/Vl(M), for any conditional posterior distribution p : Cl X M —> Mi_(0), 



vp[\$x(R)-0$_e(Rj\ <\§x[up{R)] -p$_ [vp(R)} <B 3 (v,p), 

L N 7\T N L J N 

rO v W 
lG- l (P) \ 



where B 3 {v 1 p) = v 



'■M>| -(3+ ) J A Tr cxp( - ar) {r)da] 



'{X{ P , TT, 



exp( — Ar) 



]}+ 4 



log( 



< 1/ 



[G,(A)-G-V)P + log(|^ 



»7 



/ 1 



+ ^3C(p,7r exp( _ Ar) ] } + (4 + ^r^) lo s(!)' 

and where the function G n is defined by equation jl-17\ vaae \30]) . 

A first remark: if we had the stamina to use Cauchy Schwarz inequalities (or more 
generally Holder inequalities) on exponential moments instead of using weighted 
union bounds on deviation inequalities, we could have replaced log(-) with — log(e) 
in the above inequalities. 

We see that we have achieved the desired kind of localization of Theorem ll.3.101 
(pagel25|). since the new empirical entropy term 

K^A* r t r A / u J 

CXP[-5 J Toxp(-c,r)('-)da] J 

cancels for a value of the posterior distribution on the index set v which is of the 
same form as the one minimizing the bound B\(y,p) of Theorem 11.3.101 (with a 
decreased constant, as could be expected). In a typical parametric setting, we will 
have 

n e x P (-ar){r)da ~ (A - /3)r*(m) + log d e (m), 

and therefore, if we choose for v the Dirac mass at 

m G argmm meM r (m) + a e (m), 
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and p(m,-) = 7r eX p(-Ar)(w, •), we will get, in the case when the index set M is 
countable, 

L Gift} J 



B 3 (i/, p) < max I [G v {\) - G^(j3)] , (A - 0)—^ 



n 



^{ E 



, mEM 



exp 



(fn) + l ^^Ld e (m) 



{(A - /?) [r*(m) - r*(m)] + log(|) [d e (m) - d e {m)\ } 



log( 



This shows that the impact on the bound of the addition of supplementary models 

depends on their penalized minimum empirical risk r*(m) H — j~jr d e {m). More 
precisely the adaptive and local complexity factor 



M E 



, meM 



fj,(m) 



■ exp 



x {(A-/3)[r*(m)-r*(m)] + log(£) [d e (m) - d e (m)] } 
replaces in this bound the non local factor 



M) = - log[M™)] = log 



E 



which appears when applying Theorem 1 1.3. 101 (page [25|) to the Dirac mass v = 6^. 
Thus in the local bound, the influence of models decreases exponentially fast when 
their penalized empirical risk increases. 

One can deduce a result about the deviations with respect to the posterior up 
from Theorem ll.3.151 fpage [3"Tj) without much supplementary work: it is enough for 
that purpose to remark that with P probability at least 1 — e, for any posterior 
distribution v : Q -> M+(M), 



log{7r exp( _ Ar) exp{A$^(i?) -/3*_^(JJ)} } 



G V (X) 



7T, 



G^ (0) 



exp(-ar) 



(r)da 



(3 + Sl>M 

^ ' exp 



\ (r)da 



g- 1 (p)+* 



log(f) < 0, 



this inequality being obtained by taking a supremum in p in Theorem ll.3.151 (page 
131]) . One can then take a supremum in v, to get, still with P probability at least 
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1-e, 



log< H 



cxp 



[-( 3 + G "r, <,3) ) /**«p(-ar)W(fa 

{7Texp(-Ar) CX P { A$ £ (i?) - /3$_ £ (i?) } } 



(s+S^l)- 1 



exp -(3+- 1 - — / 7r C xp(-ar)(?-)rfa 

V V 7 • / G" 1 (/3) , 



< T log(^). 



3 + 



J) 



Using the fact that x i— > x Q is concave when a — (3 H ^ — ) < 1, we get for 

any posterior conditional distribution p : ft x M — > (9), 



r ) (r)da 



cxp 



/ 3+ G^)\ 'L (Ji) .^_ i(Ji) . [ GVW 7: cxp( - ar) (r)da 



+ log 



dp 



dn, 



exp( — Xr) 



-(fh,6) 



f 4 

< exp - 



V 3 + 



I°8(f) I 



G^(/3) 



We can thus state 



Theorem 1.3.16. For any e €)0, 1(, twf/i P probability at least 1 — e, for any 
posterior distribution v : Ct —> JA\{M) and conditional posterior distribution p : 
OxM-t M\(Q), for any £ e)0, 1(, with up probability at least 1 — £, 



/■G„(A) 

A$ A (i?) -)3<f>_e_(R) < / 7r exp( _ Qr .)(r)da 



(3 + ^)lo S 



-(to) 



cxp 



-(3 + ^)"7>. 



\ (r)da 



+ log 



dp 

cxp( — Xr) 



-(m,6) 



+ (4 + ^^)log(!)-(3 + ^)log(0. 



Note that the given bound consequently holds with Vup probability at least 
(l-e)(l-0>l-e-e 
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1.4. Relative bounds 



The behaviour of the minimum of the empirical process 9 i— > r(9) is known to 
depend on the covariances between pairs \r(6), r(6')] , 8,9' £ @. In this respect, 
our previous study, based on the analysis of the variance of r(9) (or technically 
on some exponential moment playing quite the same role), loses some accuracy in 
some circumstances (namely when infe R is not close enough to zero). 

In this section, instead of bounding the expected risk p(R) of any posterior 
distribution, we are going to upper bound the difference p{R) — infe an d more 
generally p(R) — R(0), where £ is some fixed parameter value. 

In the next section we will analyse p(R) — 7r e xp(-@R)(R), allowing us to compare 
the expected error rate of a posterior distribution p with the error rate of a Gibbs 
prior distribution. We will also analyse p\{R) — p%(R), where p\ and p2 are two 
arbitrary posterior distributions, using comparison with a Gibbs prior distribution 
as a tool, and in particular as a tool to establish the required Kullback divergence 
bounds. 

Relative bounds do not provide the same kind of results as direct bounds on 
the error rate: it is not possible to estimate p(R) with an order of precision higher 
than (p(R)/N) 1 / 2 , so that relative bounds cannot of course achieve that, but they 
provide a way to reach a faster rate for p(R) — infe R, that is for the relative 
performance of the estimator within a restricted model. 

The study of PAC-Bayesian relative bounds was initia ted in the second and third 
parts of J.-Y. Audibert's dissertation (jAudibertl . l2004bl) . 

In this section and the next, we will suggest a series of possible uses of relative 
bounds. As usual, we will start with the simplest inequalities and proceed towards 
more sophisticated techniques with better theoretical properties, but at the same 
time less precise constants, so that which one is the more fitted will depend on the 
size of the training sample. 

The first thing we will do is to compute for any posterior distribution p : f7 — ^ 
M+(0) a relative performance bound bearing on p(R) — infe R- We will also com- 
pare the classification model indexed by with a sub-model indexed by one of 
its measurable subsets ©i C 0. For this purpose we will form the difference 
p(R) — R{0), where 9 £ 0i is some possibly unobservable value of the parame- 
ter in the sub-model defined by 0i, typically chosen in argminej R- If this is so 
and p{R) — R{9) = p{R) — infe x R, a negative upper bound indicates that it is 
definitely worth using a randomized estimator p supported by the larger parameter 
set instead of using only the classification model defined by the smaller set ©i. 



1.4-1- Basic inequalities 

Relative bounds in this section are based on the control of r{9)— r(9), where 9,9 £ Q. 
These differences are related to the random variables 

il>i(9,d) = <Ti(d) - ai(6) = l[f e (Xi) ^ Yi\ -t[f#Xi) + Yj\ . 

Some supplementary technical difficulties, as compared to the previous sections, 
come from the fact that ?/>i((9, #) takes three values, whereas (Ji{9) takes only two. 
Let 

1 N 

(1.19) r'(8,8)=7i8)-r(9) = -J2M0,0), 9,9 6 6, 
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and R'(9, 9) = R{9) - R{9) = P[r'(9, 9)] . We have as usual from independence that 

N 

log{p[exp[-Ar'(M)]] } = ^ log{p[cxp[- ^(0, §)]] } 

i=l 

^ Nl °s\ n S P { exp I" /v^' )] } 

Let C t be the distribution oiijj t (9,9) under P and let C = jfY,i=i C i e ^({-1,0, 
1}). With this notation 

(1.20) log{p[exp[-Ar'(0,0)]l}<JVlog(/ expf- ^)c(d^)\. 

The right-hand side of this inequality is a function of C. On the other hand, C 
being a probability measure on a three point set, is defined by two parameters, that 
we may take equal to J ipC(dip) and J ip 2 C(dip). To this purpose, let us introduce 

r i N 

M'(9,9)= ^(#) = C(+1) + C(-1) = -^P[^(^)], 9,9 eQ. 
J i=i 

It is a pseudo distance (meaning that it is symmetric and satisfies the triangle 
inequality), since it can also be written as 

1 N ~ 
M'(9, 9) = | ![/*(*<) # Yi\ -liftXi) ^Yi\\}, 9,9 e 9. 



i=l 



It is readily seen that 
iVlo. 

where 



exp ( --if, ) C(di>) \ = -A* a, [R'(9, 9), M'(0, 9)} . 



x / \ -l , T/-, \ m + p , . m-p 
W a (p,m) = -a log^(l - to) H - — exp(-a) H - — exp(a) 

(1.21) = -a _1 log|l-smh(a)[p-mtanh(|)]|. 

Thus plugging this equality into inequality (|1.20l page |3"5|) we get 

Theorem 1.4.1. For any real parameter X, 

logjp exp [-Ar' (0,5)] } < -A* a [i?'(6>, 0), M'(0, 9)] , 9,9 E Q, 

where r' is defined by equation Hl.lfA page \3^ and ^ and M' are defined just above. 
To make a l ink w ith previous work o f Mammen and Tsybakov — see e.g 



Mammen et al 



l ink w ith previous work o t Mammen and isybakov — see e.g. 
] (Il999h and lTsvbakovl (I2004T) — we may consider the pseudo-distance 



D on 6 defined by equation (jl.3i page [7]) . This distance only depends on the dis- 
tribution of the patterns. It is often used to formulate margin assumptions, in the 
sense of Mammen and Tsybakov. Here we are going to work rather with M': as it 
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jexp 


sup A 




-P6M^(6) 



is dominated by D in the sense that M'(6,6) < D(6,6), 6,6 6 O, with equality in 
the important case of binary classification, hypotheses formulated on D induce hy- 
potheses on M' , and working with M' may only sharpen the results when compared 
to working with D. 

Using the same reasoning as in the previous section, we deduce 

Theorem 1.4.2. For any real parameter X, any 9 £ 0, any prior distribution 
7reMi(e) 7 

sup A P {*a [R'(;8),M'(;6)]}-p[r , (- > 0)] -X(p,tt) y L 

We are now going to derive some other type of relative exponential inequal- 
ity. In Theorem 1 1 . 4 ."21 we obtained an inequality comparing one observed quantity 
p[r'(-,9 )] with two unobserved ones, p[R'(-,6 )] and p[M'(-, 6)], — indeed, because 
of the convexity of the function A^^., 

Xp{^^[R'(- 1 6),M'(;6)]}>X^,{p[R , (;6)],p[M'(;6)]}. 

This may be inconvenient when looking for an empirical bound for p[R' 

and we are going now to seek an inequality comparing p\R!{-,6)\ with empirical 

quantities only. 

This is possible by considering the log-Laplace transform of some modified ran- 
dom variable Xi{@i We may consider more precisely the change of variable defined 
by the equation 



exp 



A_ 

'N' 



= 1 tbi 



which is possible when -4 S )—!,!( and leads to define 



Xt = 



N 

T 



log 1 



A^ 

N 



We may then work on the log-Laplace transform 



log< 



exp 



N 



N 



: l0g|P 

log \ P 



JV 



n 



N 



N 



eXP l £ l0g 



N 



^(6,6) 



We may now follow the same route as previously, writing 

N 



log{ 



exp 



E 1 ^ 



i-^(M) 



N 



£log 



l--ptyi(M)] 



< iVlog 



i-^R'(e,e) 



Let us also introduce the random pscudo distance 
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1 N 

(1.22) m'(e,9) = -J2M0, 



»=i 



1 * i 



N 



, 9,9 GQ. 



This is the empirical counterpart of M' , implying that P(m') = M' . Let us notice 
that 



JV 



— 2^ log[l - $^(0, #)J = ^5 — r (0, 0) 



+ 



log(l-A )+log ( 1+ >)^ ?) 



JV 



2 log I irf ) r>{6 ' d) + 2 log(1 - ^ )m ' (0 ' } ■ 



JV 



Let us put 7 = y log ( y 



so that 



A = iVtanh(#) and f log(l - ^) = -7Vlog[cosh(^)] . 
With this notation, we can conveniently write the previous inequality as 

pjexp -AHog[l -tanh(%)R'(9,9)] 

-jr'(9,9) -JV"]og[cosh(#)]m'(0,0) } < 1. 
Integrating with respect to a prior probability measure ir £ Mi_(0), we obtain 

Theorem 1.4.3. For any real parameter 7, for any 9 £ 0, for any prior probability 
distribution ir £ Mi_(0), 



exp 



sup 



-Np{log[l-ta,nh(%)R'(;9)]} 

1P [r'{;6)] -JVlog[co8h(^)]p[m'(.,e)] -DC(p,7r) 



< 1. 



1.4-2. Non random bounds 



Let us first deduce a non-random bound from Theorem 11.4.21 (page l36|). This the- 
orem can be conveniently taken advantage of by throwing the non-linearity into a 
localized prior, considering the prior probability measure pi defined by its density 

dp _ cxp{ - A* , [R'(9, 6 ), M'(0, 6)]+ PR' (9, 9)} 
dn n{eiq){-X^x[R'(;9),M'(;9)] + 0R'(;9)}Y 
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Indeed, for any posterior distribution p : O — ► JvTi(0), 

X(p, p) = X( P , tt) + Apj* , [#(., 9), M'(; 9)] } - (3p[R'(; 9)] 

+ \og{n [exp{-A* a [#(-, ), M'(-, )] + /3i?'(-, )] }] }. 

Plugging this into Theorem ll.4.21 fpagel3"6 | and using the convexity of the exponen- 
tial function, we see that for any posterior probability distribution p : fl — > Mi_(0), 



[p[R'(; 9)]}< XP{p[r'(; 9)}}+ F[X(p, tt)] 

+ log{7r [exp{-A* , [R'(; 9), M'(; 9)] + (3R'(; &)]}]}■ 

We can then recall that 

Xp[r'(;9)] +X(p,TT)=X[p,ir exp{ _ Xr) ] - log{7r[exp[-Ar'(., 9 )]] }, 
and notice moreover that 

log{7r[exp[-Ar'(-,0)]]}j < - log{^[exp[-Ai?'(-, 9 )]] }, 

since R' = P(r') and h ^ log{7r[exp(»] | is a convex functional. Putting these two 
remarks together, we obtain 

Theorem 1.4.4. For any real positive parameter A, for any prior distribution 
7r G M\_ (0), for any posterior distribution p : Q, — » (0), 

P{p[#(.,0)]} < ip[3C(p )7 r exp( _ Ar) )] 

+ i log{^ [exp{-Atf a [#(•, 0),M'(-,9 )] + )] }] } 

--^log{^[exp[-Ai?'(-,0)]]} 

< ip[3C(p,7r exp( _ Ar) )] 

+ i log{^ [exp{- [TV sinh( A) 

+ 27Vsinh(2A) 2 A/'(-,0)}]} 



log{7r[exp[-Ai?'(-,0)]]}. 



It may be interesting to derive some more suggestive (but slightly weaker) bound 
in the important case when 0i = and R(9) = infe R. In this case, it is convenient 
to introduce the expected margin function 



(1.23) 



<p(x) = sup M'(6,6) -xR'(6, 9), xGM.+ . 



We see that ip is convex and non- negative on K + . Using the bound M'(9,9) < 
xR'(9, 9 ) + ip(x), we obtain 
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F{p[R'(;6)]} < -P[3C(p,7r exp( _ Ar) )] 

+ ilog|vr exp{-{Nsmh(±)[l-xUnh(^)]-l3}R'(-,e)} 

iVsinh(4)tanh(T^f) 1 r r , ~ .n 

+ y_N>_ — hNl v{x) ^ _ log ^ exp [_xR'(-,0)}\}. 

Let us make the change of variable 7 = Afsinh(j^) [l — a;tanh(ri4r)l — (3 to obtain 

Corollary 1.4.5. For any real positive parameters x, 7 and A such that x < 
tanh^)" 1 and < 7 < iVsinh(^)[l - xtanh(^)], 

P[p(#)] -igfie<{jVBinh(^)[l-ajtanh(g^)] -7}'' 

X ^ [7fexp(-aH) ~ "if ^ 

+ ATsinh(^) tanh(^)^(x) + P[3C(p, 7T exp( _ Ar) )] 

Let us remark that these results, although well suited to study Mammen and 
Tsybakov's margin assumptions, hold in the general case: introducing the convex 
expected margin function ip is a substitute for making hypotheses about the relations 
between R and D. 

Using the fact that R'(0,6) > 0, 9 G Q and that y>(x) > 0, x G K + , we can 
weaken and simplify the preceding corollary even more to get 

Corollary 1.4.6. For any real parameters (3, A and x such that x > and < 
[3 < A — a^^y-, /or any posterior distribution p : Q — > (0), 



»[p(i?)] <infi? 



^ ^ 2iV ^ 



[7To Xp (-afl)(-R) - inf 



+ P{3C[p,7r cxp( _ Ar) ]}+^(a:) — 



Let us apply this bound under the margin assumption firs t considered by Mam- 
men and Tsybakov ( Mammen et al. . 19991 : Tsvbakov . 2004 ). which says that for 
some real positive constant c and some real exponent k > 1, 

(1.24) R'(9, 6) > cD{9, 9) R , 6 G 6. 

In the case when k = 1, then ip(c~ 1 ) — 0, proving that 



c {7r eX p(-Ar) 0)] } < 



7Tcxp(-7fl) [#'(-, 0)] (fry 
Arsinh(^r)[l -c- 1 tanh(^ 7 )] - /3 



< 



f A 

J/3 7r cxp(- 7 i?) 



[#'(•, 0)] (fry 
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Taking for example A = ^,/3=-| = ^, we obtain 

cN 

8 f~ ~ 

^ l^exp(~2- 1 cNr) 

4 

<infiZ + 2 7 r exp( _^ ) [#(-,0)]. 

If moreover thejjehaviour of the prior distribution n is parametric, meaning that 
Kcxpi-pR) [R'('i @ )] — f ' f° r some positive real constant d linked with the dimension 
of the classification model, then 

In the case when k > 1, 

< (k — l)re~' ; » TrT (ca;)~' ; ^ rT = (1 — k _1 )(kc2;) _ »-! , 



thus P{7r exp( _ Ar) [#(•,<?)]} 



r p ir cxp{ ^ R) [R'(-,6)]dj + (1 - k 1 )(kcx) 



A - f^- - /3 

Taking for instance /? = -|, x ~ |j, and putting 6 = (1 — k~ 1 )(ck)~'^ t , we obtain 



4 / ~ /2A 

P[7r e xp(-Ar)(ii)] -infi?<- j^Tr eM _ lR) [R'(;6)]d 1 + b^ — 

In the parametric case when 7r oxp (_ 7 ^) [-R'(-, # )] < ^, we get 

FK^^-mf^i^^^)^. 

Taking 

A = 2- 1 [81og(2)d]^ r ( K c)^iV5^T, 

we obtain 

'[x^w] -«**<<>- .-)(«>--!' (iifiif)** 1 . 

We see that this formula coincides with the result for k = 1. We can thus reduce 
the two cases to a single one and state 

Corollary 1.4.7. Let us assume that for some 9 G 9, somejpositive real constant 
c, some real exponent n > 1 and for any 9 e 6, > i?(6*) + cD(9,9) K . Let us 

also assume that for some positive real constant d and any positive real parameter 
1, 7Tcxp(- 7 ,R)(.R) - inf R < |. T/ien 



7T , , (R) 

. exp{-2- 1 [81og(2)<i]2»-i ( Kc ) 2 »- 1 JV r | 

< inf i?+ (2 - K- 1 )^) - ^ ^ 



81og(2)<A 2 »- 1 
iV J 
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Let us remark that the exponent of N in this corollary is known to be the mini- 
max exponent under these assumptions: it is unimprovable, whatever estimator is 
used in place of the Gibbs posterior shown here (at least in the worst case com- 
patible with the hypotheses). The interest of the corollary is to show not only the 
minimax exponent in N, but also an explicit non-asymptotic bound with reason- 
able and simple constants. It is also clear that we could have got slightly better 
constants if we had kept the full strength of Theorem 11.4.41 (page |38|) instead of 
using the weaker Corollarv ll.4.61 (page l39|) . 

We will prove in the following empirical bounds showing how the constant A can 
be estimated from the data instead of being chosen according to some margin and 
complexity assumptions. 

1.4-3. Unbiased empirical bounds 

We are going to define an empirical counterpart for the expected margin function 
if. It will appear in empirical bounds having otherwise the same structure as the 
non-random bound we just proved. Anyhow, we will not launch into trying to 
compare the behaviour of our proposed empirical margin junction with the expected 
margin function, since the margin function involves taking a supremum which is 
not straightforward to handle. When we will touch the issue of building provably 
adaptive estimators, we will instead formulate another type of bounds based on 
integrated quantities, rather than try to analyse the properties of the empirical 
margin function. 

Let us start as in the previous subsection with the inequality 



We have already defined by equation (|1.22[ page l37|l the empirical pseudo-distance 



Recalling that F\m'(9, 9)] = M'(8,6), and using the convexity of A h 



0F{p[R'(; 6)]}< v[\p[r'(; 9 )] + %{p, tt)} 






We may moreover remark that 
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Xp[r'(;6)] +X(p,n) = [P-N S mH±)+\]p[r'(;e)) 

+ 3C[p, 7r cX p{_[Ar s i n h(A)_ ) g] r }] 

- log{7r[exp{-[iVsinh(A) - 0]r'(-,6)}] }. 

This establishes 



Theorem 1.4.8. For any positive real parameters (3 and X, for any posterior dis- 
tribution p : Q — > 



< 



l - 



iVsinh(^r) - A 



p[r'(;0)] 

^[Pi n cK P {-[N sinh(4)-/3]r}] 



3 



/T^Og^, 



exp{-[AT s inh(A)_ /3 ] r .} 



exp 



[jVsinh(A)tanh(^)m'(-^)]]} 



Taking (3 — ^ sinh(-^), using the fact that sinh(a) > a, a > and expressing 
tanh(|) = a" 1 [^1 + sinh(a) 2 — l] and a — log[yI + sinh(a) 2 +sinh(a)] , we deduce 

Corollary 1.4.9. For any positive real constant (3 and any posterior distribution 



>{p[R'(;9)]} <P 



log VI 



A '■ 1 - 1 1 *K + — ) i 



P[r'(;0)} 



+ \ -K-lPl 7F cxp(-/3r)J 



log 



""exp(-/3 



r) |exp N(y 



Ml 



-1 )m'(- 



'(•,*)]} 



This theorem and its corollary are really analogous to Theorem 11.4.41 (page [ 
and it could easily be proved that under Mammen and Tsybakov margin assump- 
tions we obtain an upper bound of the same order as Corollary 11.4.71 (page [40j) . 
Anyhow, in order to obtain an empirical bound, we are now going to take a supre- 
mum over all possible values of 6, that is over 0i. Although we believe that taking 
this supremum will not spoil the bound in cases when over-fitting remains un- 
der control, we will not try to investigate precisely if and when this is actually 
true, and provide our empirical bound as such. Let us say only that on qualitative 
grounds, the values of the margin function quantify the steepness of the contrast 
function R or its empirical counterpart r, and that the definition of the empirical 
margin function is obtained by substituting P, the true sample distribution, with 

= (~k 5-^=1 3(x i ,Yi))^' N > the empirical sample distribution, in the definition of 
the expected margin function. Therefore, on qualitative grounds, it seems hopeless 
to presume that R is steep when r is not, or in other words that a classification 
model that would be inefficient at estimating a bootstrapped sample according to 
our non-random bound would be by some miracle efficient at estimating the true 
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sample distribution according to the same bound. To this extent, we feel that our 
empirical bounds bring a satisfactory counterpart of our non-random bounds. Any- 
how, we will also produce estimators which can be proved to be adaptive using 
PAC-Bayesian tools in the next section, at the price of a more sophisticated con- 
struction involving comparisons between a posterior distribution and a Gibbs prior 
distribution or between two posterior distributions. 

Let us now restrict discussion to the important case when 8 £ argminei-R- 
To obtain an observable bound, let 9 £ argming G e r(9) and let us introduce the 
empirical margin functions 

Tp(x) = sup m! '(0,6) - x\r(6) -r(0)], x £ K+, 
see 

y(x) = sup m'(0, 0) - x\r(0) - r(0)] , x £ R+. 
Using the fact that m'(0, 0) < m'(0, 0) + m'(0, 0), we get 

Corollary 1.4.10. For any positive real parameters [3 and X, for any posterior 
distribution p : £1 — ► M+(0), 



P[p(R)] -mfi? < 



N sinh(A)_A 







[p(r) - r(0)] 



+ 



X[p,7T, 



exp{-[iV S inh(A)-/3]r}J 



+ (3 1 log|7T cxp{ -_ [JVsinh( *.)_£],.} 

+ /?- 1 iVsinh(A)tanh(^)^ 



exp[A^sinh(A) tanh(^)m'(-, §)]] } 
(3 ( 7Vsinh(A)_A^ 



A^sinh(A) tan h(^) 



Taking (3 = ^ smn (;^); we a ^ so obtain 



'[p(R)]-MR< 



)- 



[p(r) r(6)] 



<i 



X[p,TT, 



exp(-/3r)J 



+ log 



^cxpf — j3r) 

N 



| exp iv( 



1 + ^-1 )'«'(■ 



'(■,*)]} 



+ 



4/r 



1 W 



log 



1 -I- _L - ' I 

1 ' Iv 5 " ' N j N 



Note that we could also use the upper bound m'(6,9) < x[r(9) — r(0)] + (fi(x) 
and put a = N sinh(-^-) [l — a: tannf^^-)] — [3, to obtain 



Corollary 1.4.11. For any non-negative real parameters x, a and X, such that 
a < 7Vsinh(-^-) [l — a: tanhf^^-)] , for any posterior distribution p : O — ► (0), 
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¥[p(R)] 



MR 

0i 



< 



iVsinh(^)[l - xt&nh(j^)] -A 



xtanh^)] 



7Vsinh(A)[i 

^ [p? ^cxp(-ar)] 

iVsinh(^)[l - xtanh(^)] - a 

Asinh^tanh^) 
iVsmh(^)[l - xtanh^)] - a 



[p(r) - r(6)] 



tp(x) + (p 



A 



Asinh(A)tanh(^ 



Let us notice that in the case when 0x = G, the upper bound provided by this 
corollary has the same general form as the upper bound provided by Corollarv ll.4.51 
(page [351), w ith the sample distribution P replaced with the empirical distribution 

of the sample P = (-k Y^iLi ^(x i ,Y i ))' g ' N ■ Therefore, our empirical bound can be of 
a larger order of magnitude than our non-random bound only in the case when our 
non-random bound applied to the bootstrapped sample distribution P would be of 
a larger order of magnitude than when applied to the true sample distribution P. In 
other words, we can say that our empirical bound is close to our non-random bound 
in every situation where the bootstrapped sample distribution P is not harder to 
bound than the true sample distribution P. Although this does not prove that our 
empirical bound is always of the same order as our non-random bound, this is a good 
qualitative hint that this will be the case in most practical situations of interest, 
since in situations of "under-fitting" , if they exist, it is likely that the choice of the 
classification model is inappropriate to the data and should be modified. 

Another reassuring remark is that the empirical margin functions Tp and p behave 
well in the case when infer = 0. Indeed in this case m'(0,9) — r'(6,0) = r(9), 
9 G 8, and thus Tp(l) = p{\) = 0, and 

<p{x) < —(% — 1) hifej r, x > 1. 
This shows that in this case we recover the same accuracy as with non-relative local 
empirical bounds. Thus the bound of Corollary 1 1 . 4 . 1 1 1 does not collapse in presence 
of massive over-fitting in the larger model, causing r{6) — 0, which is another hint 
that this may be an accurate bound in many situations. 



1.4-4- Relative empirical deviation bounds 



It is natural to make use of Theorem ll.4.31 (page 157)) to obtain empirical deviation 
bounds, since this theorem provides an empirical variance term. 

Theorem ll.4.31 is written in a way which exploits the fact that ipi takes only the 
three values —1, and +1. However, it will be more convenient for the following 
computations to use it in its more general form, which only makes use of the fact 
that ipi G (— 1, 1). With notation to be explained hereafter, it can indeed also be 
written as 



(1.25) 



exp 



sup 

P&MU0) 



ATp{log[l-AP(VO]} 



+ 



Np{ 



P 



log(l - AVO 



}-X(p,7T) 



< 1. 
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We have used the following notation in this inequality. We have put 

_ i N 

»=i 

so that P is our notation for the empirical distribution of the process 
(Xi, Yi)f =1 . Moreover we have also used 

_ ! N 

where it should be remembered that the joint distribution of the process (JQ, Yi)^L 1 
is P = ®iLiPi- We have considered ip(8 7 9) as a function defined on X x y as 
ip(9,6)(x,y) = l[y ^ / (.t)] - l[y ^ /g{or)], (a;,?/) 6 X x y so that it should be 
understood that 



i w 



i N 

- ^p{i[r, # - i[Yi ? f^)] } = R'{ 



In the same way 



P 



log(l-A^) = -£>g[l-A^,(M)]. 



Moreover integration with respect to p bears on the index 9, so that 

P {iog[i - ap(v)] } = I logji - A^p[^.(e 1 flr)]| p ( (W ) 



p{p[log(l-A^)]} 



2 = 1 



TV 



iV 



£>g[l-A^(M)] p(^). 



i=l 



We have chosen concise notation, as we did throughout these notes, in order to 
make the computations easier to follow. 

To get an alternate version of empirical relative deviation bounds, we need to find 
some convenient way to localize the choice of the prior distribution n in equation 
(|1.251 page |4"4"| . Here we propose replacing it with /i = 7r C xp{-Anog[i+/3P()/j)]}> which 
can also be written ^ cxp{ _ Nlog[1+f3W{ .J }]y Indeed we see that 

DC(p, /x) = Np{log[l + pPtyj] } + X(p, tt) 

+ log|7r exp {-ATlog[l + /3P(V;)]}]}. 

Moreover, we deduce from our deviation inequality applied to —ip, that (as long as 

/?>-!), 



exp 



iV M {p[log(l + } - JV/x{log[l + /3P(V)] } 



< 1. 
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Thus 



cxp 



log{^[exp{-7Vlog[l + pPty)] }] } 

-log{7r[exp{-7VP[log(l + ^)]}]} | 
-JV/i{log[l + W)]}-3C0i,7r) 



< exp 



< 1. 



This can be used to handle 0C(p, n), making use of the Cauchy-Schwarz inequality 
as follows 





1 


jexp 


2 



1/2 



-7Vlog{(l-Ap[P(V)])(l + /?p[P(V)])} 
+7V /5 {p[log(l-A^)]} 

- X(p, tt) - log{^ [exp{ -JVP[log(l + /#)] }] } 

<p|cxp -JVIog{(l-Ap[P(^)])} 

+ iV P {p[log(l-A^)]}-aC(p, M ) 
xpjexp log|7r[exp{-7Vlog[l + /3P(7/;)]} j 

- log{^ [cxp{-7VP[log(l + /3V)] }] } 
This implies that with P probability at least 1 — e, 
- 7Vlog{(l - \p[P(^)]) (l + /?p[PW])} 



1/2 



< 1. 



< 



-ATp{p[log(l-AVO]} 

3C(P,tt) + log{7r[exp{-iVP[log(l + /ty)] }] } - 21og(e). 



+ : 



It is now convenient to remember that 



log(l - AV>) 



1 



log 



1 - A 

r+A 



r'(8,8) + -\og(l-\ 2 )m'(e,6). 



We thus can write the previous inequality as 
- 7Vlog{ (l - \p[R'(; 9)] ) (l + (3 P [R'(; 0)] ) } 



< ylog 



1 + A 
1 - A 



N 



P[r'(; 0)] - - log(l - A 2 )p[m'(., *)] + X(P> *0 



1.4- Relative bounds 



47 



exp 



log(l-/3 2 )m'(-,0)} 



N 
~2 
N 
~2 



21og(e). 



Let us assume now that 9 <E argminej R. Let us introduce 9 <E argmine r. Decom- 
posing r'(6, 9) = r'(6, 9) + r'(9, 9) and considering that 

m'(9,9) <m'{9,9) + m'{9,9), 
we see that with P probability at least 1 — e, for any posterior distribution p : O — ► 



- N\og[ (l - Xp[R'(; 9)] ) (l + (3p[R'(; 9)) } 



< ylog(^-£)p[r'M)] -^l O g(l-A 2 )p[m'(,0)] + 3C(p )7 r) 

+ fog<^ 7T 



exp{-f log(i±f ) [/(■, )] - f log(l - /3 2 )m'(, )} 



+ f log 



(l+A)(l-/3) 
(l-A)(l+/3) 

N 



[r 



- f log[(l - A 2 )(l - /3 2 )] m'((9 , 9) - 2 log(e). 
Let us now define for simplicity the posterior v : fl — ► M^(G) by the identity 



exp 



{-f log(^)r'(0, 0) + f log(l - A 2 )m'(M)} 



cxp{-f log(i±A)r'(, 0) + § log(l - A 2 )m'(-, ?)} 
Let us also introduce the random bound 



B = — log^ v 



exp 



flog 



(l + A)(l-/3) 
(l-A)(l+/3) 



r'(,0) 

f log[(l-A 2 )(l-/? 2 )]m'M) 



sup - log 



(l-A)(l+ffl 
(l+A)(l-/3) 



ilog[(l-A 2 )(l-/3 2 )]m'(0,0)-llog( e ). 



Theorem 1.4.12. C/smg t/ie a&ore notation, for any real constants < /3 < A < 1, 
/or ant/ prior distribution it G Mi_(©), /or any subset Gi C 6, twi/i P probability at 
least 1 — e, /or ant/ posterior distribution p : O — ► M^_(0), 

- log{ (l - A - inf i?] ) (l + /3 [p(i2) - inf fl] ) } < + B. 

Therefore, 
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p(R)-MR 

0i 



< 



\-(3 
2A/3 



'1 + 4 



A/3 



1 — exp I —B — 



N 
1 



- 1 



< 



A-/3 



B 



N 



Let us define the posterior v by the identity 



dm 



It is useful to remark that 



exp [- flog 




r'(0,0)-f \og(l-f3 2 )m>(9,6) 




7r|cxp 


-flog (}±£) l"(-,5) - $ log(l - /P)m'(-,fl/ 


} 



N 



\og{ v 



exp 



TV 



■log 



(l + A)(l-/?)\, 



Ul 



(1 - A)(l +/3) 



r'(-,i 



TV 



- Y log[(l-A 2 )(l-/3 2 )]m'(,0) 



(l + A)(l-/3) 
(l-A)(l + /3) 



/(■,« 



-log[(l-A 2 )(l-/? 2 )]m'M)^. 



This inequality is a special case of 
log|7r[exp( 3 )] | - log{7r[exp(/j)] | 

= / ^ cxp [h+a( g -h)](g - h)da < TT cxp (g)(g - h), 

Ja=0 

which is a consequence of the convexity of a i— > log|7r exp[/i + a(g — h)j |. 

Let us introduce as previously ^(x) = sup 0ee m'(9, 9) — xr' '(9,9), x G R+. Let 
us moreover consider ip(x) = sup 6 , e0i m'(9, 9) — x r'(9, 9), x e R+. These functions 
can be used to produce a result which is slightly weaker, but maybe easier to read 
and understand. Indeed, we see that, for any x G M+, with P probability at least 
1 — e, for any posterior distribution p, 

- N\og[ (l - \p[R'{; 9)] ) (l + pp[R'{; 9)] ) } 
(1 + A) 



N 1 

<ylog 

N 
~ ~2 



p[r'(;0)] 



_(1-A)(1-A 2 )* 
log[(l-A 2 )(l-/3 2 )]^(x)+3C(p^) 



+ log<^ 7T 



N 



cxp{-f log 



(l-/3)(l-/32)« 



-_log[(l-A 2 )(l-/3 2 )]^ 



log 



(l+A)(l-/3) 
(l-A)(l+/3) 



log[(l-A 2 )(l-/3 2 )] 



21og(e) 
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.XL l oj < 1 + A > 

r ^ 1os Ui-a)(i 

Ik i „r <i± 

2 s Ui-wi 



-M) 



^cxp(-or) 



[r'(-,0)]da 



+ X(p,Tr , N , , (i+A) , J — 21og(e) 

V' exp{-f '°g[ (1 _ A)(1 _ A i ) J '-} ; tov ; 



iV 



log[(l-A 2 )(l-/? 2 )] 



y>(x) + <p 



L (l-A)(l 

lot 



(l + A)(l-/3) 
(l-A)(l+/3) 



-log[(l-A 2 )(l-/3 2 



Theorem 1.4.13. W^i/i the previous notation, for any real constants < (3 < A < 
1, /or any positive real constant x, for any prior probability distribution it G (9), 
for any subset 9i C Q, with P probability at least 1 — e, for any posterior distribution 
p : tt -> (9), putting 



B(p) 



1 



jv loe r (i+A) 

2 S L (1-A)(1-A 2 ) = 



r o±i 

L(l-/3)(1- 



+ ■ 



r iV i 

cxpj--^ log 



^cxpf — ar) 

[r'(-,6)]da 
tl±A) ]r} )-2Iog(e) 



(l-A)(l-A 2 )^ 



7V(A - /?) 



1 



< 



2(A-/3) 
1 



log[(l-A 2 )(l-/3 2 )] 



log 



(l+A)(l-ffl 
(l-A)(l +/ 3) 



log[(l-A 2 )(l-/3 2 



iV(A - /?) 



d e log 



log 


(1+A) 1 


L(l 


-A)(l-A 2 )-J 


log| 




(1+/3) A 


,(1 


-/3)(l-/3 2 )^ 



+ ■ 



log[ (1+A) ■ — l r > 



J-21og(e) 



7V(A - /3) 



1 



■log[(l-A 2 )(l-/? 2 )] 



2(A-/3) 
rte following bounds hold true: 



log 



(l + A)(l-/3) 
(l-A)(l+/3) 



-log[(l-A 2 )(l-/? 2 



p(R)-MR 

0! 



< 



A-/3 
2A/3 



4A/3 



(A-/?) 2 



{l-exp[-(A-/?)B(p)]} 



- 1 



< B(p). 



Let us remark that this alternative way of handling relative deviation bounds 
made it possible to carry on with non- linear bounds up to the final result. For 
instance, if A = 0.5, (3 = 0.2 and B(p) = 0.1, the non-linear bound gives p(R) — 
inf 01 R < 0.096. 
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Chapter 2 



Comparing posterior 
distributions to Gibbs priors 

2.1. Bounds relative to a Gibbs distribution 

We now come to an approach to relative bounds whose performance can be analysed 
with PAC-Bayesian tools. 

The empirical bounds at the end of the previous chapter involve taking suprema 
in 6 G 0, and replacing the expected margin function <p with some empirical coun- 
terparts Tp or !p, which may prove unsafe when using very complex classification 
models. 

We are now going to focus on the control of the divergence % [p, 7r exp (_ /3fl )] . It 
is already obvious, we hope, that controlling this divergence is the crux of the 
matter, and that it is a way to upper bound the mutual information between 
the training sample and the parameter, which can be expressed as %\p, P(p)l = 
X[p, 7r eX p(-/3.R)] - ^[P(p), 7r eX p(-/3ii)] , as explained on page[H 

Through the identity 

(2.1) 3C[p,7r exp( _a, R )] = f3[p(R) - 7r oxp (_ i3fl) (i?)] 

+ K(/9, T) - 3C[7Texp(-j8K))7r] , 

we see that the control of this divergence is related to the control of the difference 
p(R) — 7T eX p(_ i am(i?). This is the route we will follow first. 

Thus comparing any posterior distribution with a Gibbs prior distribution will 
provide a first way to build an estimator which can be proved to reach adaptively 
the best possible asymptotic error rate under Mammen and Tsybakov margin as- 
sumptions and parametric complexity assumptions (at least as long as orders of 
magnitude are concerned, we will not discuss the question of asymptotically opti- 
mal constants). 

Then we will provide an empirical bound for the Kullback divergence %\p, 
7r ox P (-/3ij)] itself. This will serve to address the question of model selection, which 
will be achieved by comparing the performance of two posterior distributions possi- 
bly supported by two different models. This will also provide a second way to build 
estimators which can be proved to be adaptive under Mammen and Tsybakov mar- 
gin assumptions and parametric complexity assumptions (somewhat weaker than 
with the first method). 
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Finally, we will present two-step localization strategies, in which the performance 
of the posterior distribution to be analysed is compared with a two-step Gibbs prior. 



2.1.1. Comparing a posterior distribution with a Gibbs prior 



Similarly to Theorem 11.4.31 (page I37|) we can prove that for any prior distribution 
7reM}.(e), 



(2.2) P<^<g>^exp 



-7Vlog(l-iVtanh(^)i?') 



jr' — N log [cosh (j?)] m' 



< 1. 



Replacing tt with Tr e xp(-/3R) and considering the posterior distribution p®7r exp (__^), 
provides a starting point in the comparison of p with 7r xp(-/3.R); we can indeed state 
with P probability at least 1 — e that 



(2.3) 



JVlogjl - tanh(^) [p(R) - n cxp( _ m (R)\ } 

< 7[ j o(r) - TT cxp{ -f3 R ){r)] +iVlog[cosh(^r)] [p ® Tr^-pR)] (m') 

+ 3C[p,7r exp (_^ ii )] -log(e). 

Using equation (|2.1[ pagel5Tj) to handle the entropy term, we get 

(2.4) - AHogjl - tanh(^) [p(R) - n eM _ m (R)\ } - (3[p(R) - TT eM - 0R) (R)] 

<l[p(r) ~ 7Tcx P (-/3i?.)( r )] +A r log[cosh(^)]p®7r CX p ( _ /3i? , ) (TO / ) 

+ 3C(p, tt) -3C[7r eX p(_g fl ),7r] -log(e). 

We can then decompose in the right-hand side j[p(r) — k c ->c[>(-(3R) ( r )] into (7 — 
^)[p( r ) - 7r ox P (-/3i?)(^)] + A[p(r) - 7r eX p(-0H) (r)] for some parameter A to be set 
later on and use the fact that 

X[p(r) - n cxp{ _ pR) (r)] +iVlog[cosh(^)] / 9(g)7r eX p(_ | 3 H )(m') 

+ X(p, Tl) - X [7T cxp (_ /3fl ) , 7r] 

< Ap(r) + 3C(p, tt) + log|7r exp{-Ar + iVlog[cosh(^)]p(m')} } 



3C[/0,Texp(-Ar)] + log {77, 



exp( — Ar) 



exp{7Vlog[cosh(^)]p(m')} | 



to get rid of the appearance of the unobserved Gibbs prior 7r C xp(- / 3i?) m most places 
of the right-hand side of our inequality, leading to 

Theorem 2.1.1. For any real constants j3 and 7, with P probability at least 1 — e, 
for any posterior distribution p : Q — > Mi_(0), for any real constant A, 



[JVtanh(£) - 0] [p(R) - n eM _ m (R)] 

< -iVlogjl - tanh(^) [p(R) - 7r exp( _^)(i?)] } 
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- P[p(R) ~ Kexp(-/3R)(R)] 

< (7 - A) [p(r) ~ 7r exp (_ /3H ) (r)] + 3C [p, 7r cxp( _ Ar .)] 

+ log{7r exp( _ Ar ) exp{7Vlog[cosh(^-)]p(m')} } - log(e) 

= X[p, 7T OX p(_ 7r )] 

+ log{ [exp{( 7 - X)r + Anog[cosh(^)] p(m')}] } 

- (7 - A)7r cxp( _ /3fl) (r) - log(e). 

We would like to have a fully empirical upper bound even in the case when A 7^ 7. 
This can be done by using the theorem twice. We will need a lemma. 

Lemma 2.1.2 For any probability distribution tt € Mi_(0), for any bounded mea- 
surable functions g, ft : — ► M, 

(ft) 

^"exp( — ft,) 

(ft). 

Proof. Let us notice that 

< 3C(7r oxp (_ s ),7r oxp( _ ft) ) = 7r exp( _ 3) (ft) + log{7r[exp(-ft)] } + 3C(7r cxp( _ s) , tt) 

= 7Texp(-g) ify - ^exp(-h) W ~ ^(^oxpf-ft) , 7r) + 3C(7T exp (_ g ) , 7r) 
= 1"exp(-g) (ft) - T<«p(-h) (ft) - 3C(7T exp (_ h ) , 71") - 7T oxp( _ g ) (5) - log{7T [exp(-fif)] } . 



Moreover 



log{7r[exp(-g)] } < ^ cxv (-h){g) + K(TT cxp{ _ h) , tt), 



which ends the proof. □ 

For any positive real constants [3 and A, we can then apply Theorem 12.1.11 to 
P = ^oxpf-Ar)! an d use the inequality 

( 2 ' 5 ) Jj [ n cxp(-\r) (r) - 7Toxp(-/3fl) (r)] < 7Toxp(- Xr) (R) - "'exp(-^il) (R) 

provided by the previous lemma. We thus obtain with P probability at least 1 — e 
- ATlogjl - tanh(^)-| 7r oxp( _ Ar) (r) - 7r cxp( _ /3fl) (r) } 

< log|7r exp( _ Ar ) exp{iVlog[cosh(^)]7r exp (_ Ar )(m')} }-!°g( e )- 
Let us introduce the convex function 

F 7iQ (x) = -iVlog[l - tanh(^)a:] - ax > [iVtanh(^) - a]x. 
With P probability at least 1 — e, 



7T e xp(-/9R)(»") < ,inf <^ -7r cxp( _ Ar) (r) 

At 



A F 7,^ 



log|7r exp( _ Ar) exp{7Vlog[cosh(^)]7r oxp( _ A ,,)(TO')} } 
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Since Theorem 12 . 1.11 holds uniformly for any posterior distribution p, we can apply 
it again to some arbitrary posterior distribution p. We can moreover make the result 
uniform in (3 and 7 by considering some atomic measure v G M]j_(R) on the real 
line and using a union bound. This leads to 



Theorem 2.1.3. For any atomic probability distribution on the positive real line 
v G Mj_(M+), with P probability at least 1 — e, for any posterior distribution p : 
il — > Mi_(0), for any positive real constants P and 7, 

[iVtanh(^) - 0\ [p(R) - n cxp{ _ fm) (R)] 

< F i,p[p{R) - ^ck P (-pr){R)] < B(p,{3,-f), where 

B(p,p,i)= inf \ X[p,n mp (-x ir )] 

Ai tK-f- ,Ai 5-7 
A 2 GR,A 2 >^tanh(i)- 1 

+ (7 - Ai) - 7r cxp( _ A2r) (r)] 
logj 



+ log<^ TTcxpf-Air) 



exp{7Vlog[cosh(^)]p(m')}] } - log[ei/(/3M7)] 



+ (7-Ai)f F-^[log{ 



T cxp( — A 2 r) 



exp{iVlog[cosh(^)]7re Xp (-A 2 r)(m')} } 

-log[eK/3>(7)] 



< 



inf 

AiGK + ,Ai<7 
A 2 eK,A 2 >^tanh(i)- 1 



3C [p, 7T, 



exp(-Air)J 



+ (7 - Ai) [p(r) - 7T oxp( _ A2r) (r)] 



l0g|7T, 



exp(-Air) 



exp{7Vlog[cosh(^)]p(m')}]} 



P 



(1-^) 



— ) 

«/ A 2 J 



exp(— A 2 r) 



A 2 tanh( 

exp{7Vlog[cosh(^)] 

/3 (i 



-{ 



1 



K)}]} 



where we have written for short v(P) and v("f) instead of v({p}) and ^({7})- 

Let us notice that B(ft,0, 7) = +00 when f(P) = or ^(7) = 0, the uniformity 
in P and 7 of the theorem therefore necessarily bears on a countable number of 
values of these parameters. We can typically choose distributions for v such as the 
one used in Theorem ll . 2.81 fpage 1 1 3 | : namely we can put for some positive real ratio 
a > 1 



u(a k ) 



1 



(fc+l)(fc + 2)' 



k G N. 
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or alternatively, since we are interested in values of the parameters less than N, we 
can prefer 



v{a k ) 



log(a) 
log(adV) ' 



< k < 



log(a) ' 



We can also use such a coding distribution on dyadic numbers as the one defined 
by equation (|1.7l page [T5| . 

Following the same route as for Theorem 11.3.151 (page l3"Tj) . we can also prove the 
following result about the deviations under any posterior distribution p: 



Theorem 2.1.4 For any e €)0, 1(, with P probability at least 1 — e, for any posterior 
distribution p : O — > Mi_(0), with p probability at least 1 — £, 



F 7tP [R(9) - n c ^ m (R)] < inf i log 

A 2 eR,A 2 >^tanh(-i)- 1 

+ (7 - Ai) [r(9) - 7r cxp( _ A2r) (r)] 



dp 



dn, 



cxp( — Air) 



■(<?; 



+ log[7r exp( _ Air) exp{jVlog[cosh($)]m'(-,0)} | - log [e&(/3) v^)] 



logj 



exp{iVlog[cosh(^)]7r exp( _ A2r )(m')} } 

-log[ei/(/?M7)] 



The only tricky point is to justify that we can still take an infimum in Ai without 
using a union bound. To justify this, we have to notice that the following variant of 
Theorem 12.1.11 (page [52]) holds: with P probability at least 1 — e, for any posterior 
distribution p : O — > Mi_(0), for any real constant A, 

(R)]}< %[ P 

,{( 7 - A)r + iVlog[cosh(#)]m'(-,0)}] } 



inf log|7r CX p(_ 7r ) 



exp^ 



(7- A)7r oxp( _ /3iJ) (r) 



log(e) 



We leave the details as an exercise. 



2.1.2. The effective temperature of a posterior distribution 

Using the parametric approximation 7r exp (_ aj .) (r) — mfe r — —, we get as an order 
of magnitude 



S(7r exp (_ Air ), (3, j)<-(j- Ai)4 [A 2 1 - A x *] 
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241 ° g Ai-iVlogfcosh^)]* 

+ 2f rM , JT^elog 



2iVlog[cosh(^)] 




A 2 [£tanh(£)-£] 6 b \\ 2 -Nlog[co S H%)] 

(1-4*0 



- 1 



A 2 [ftanh^)-£]. 

}log[i/C9)i/( 7 )( 



A 2 [£ t anh(#)-£ 



Therefore, if the empirical dimension d e stays bounded when N increases, we are 
going to obtain a negative upper bound for any values of the constants Ai > A 2 > 0, 
as soon as 7 and — are chosen to be large enough. This ability to obtain negative 
values for the bound B(ir cxp (_x ir ), 7, 0), and more generally B(p, 7, leads the 
way to introducing the new concept of the effective temperature of an estimator. 

Definition 2.1.1 For any posterior distribution p : O — » Mi_(0) we define the 
effective temperature T(p) glRU {—00, +00} of p by the equation 

P(R) =7r exp( __a y) ( J R). 

Note that /? tt cxp (-(3r){R) ■ K U {—00, +00} — > (0, 1) is continuous and strictly 
decreasing from esssup„. -R to essinf^ R (as soon as these two bounds do not co- 
incide). This shows that the effective temperature T(p) is a well-defined random 
variable. 



Theorem 12.1.31 provides a bound for T(p), indeed: 

Proposition 2.1.5. Let 

0(p) = sup{/3 e R; inf B(p, 0, 7) < 0}, 

7,Artanh(^)>/3 

where B(p,[3,"/) is as in Theorem ] 2.1.3\ (page \5l$ . Then with P probability at least 
1 — e, for any posterior distribution p ; Cl — > M^_(0), T(p) < 0{p)'~ 1 , or equivalently 

This notion of effective temperature of a (randomized) estimator p is interesting 
for two reasons: 

• the difference p(R) — 7r ex p(— 0R) (R) can be estimated with better accuracy 
than p(R) itself, due to the use of relative deviation inequalities, leading to 
convergence rates up to 1/JV in favourable situations, even when info R is not 
close to zero; 

• and of course n C xp(-f3R){R) is a decreasing function of /3, thus being able to 
estimate p(R) — ^ C xp(-/3R)(R) with some given accuracy, means being able 
to discriminate between values of p(R) with the same accuracy, although 
doing so through the parametrization (3 1— > n exp r-[3R) (R), which can neither 
be observed nor estimated with the same precision! 
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2.1.3. Analysis of an empirical bound for the effective temperature 



We are now going to launch into a mathematically rigorous analysis of the bound 
■B( 7r exp(-Air),/3,7) provided by Theorem 12.1.31 (page l54|) . to show that vai p ^M\(Q) 
7r , j, , n , (R) converges indeed to info R at some optimal rate in favourable sit- 

oxp[— p(p)R.\ v ' ° J r 

uations. 

It is more convenient for this purpose to use deviation inequalities involving M' 
rather than m'. It is straightforward to extend Theorem ll.4.21 fpage [36]) to 

Theorem 2.1.6. For any real constants (3 and 7, for any prior distributions n, p, G 
M+(0), with P probability at least 1 — n, for any posterior distribution p : — * 

MV(e) 7 

lP®n eM _ 0R) [$jL(R',M')] < jp®n eM _ 0R) (r') + X{ P) n) -logfa). 

In order to transform the left-hand side into a linear expression and in the same 
time localize this theorem, let us choose p defined by its density 



^(e 1 )=C- 1 exp -I3R{0i) 
air 

-7 / {^^[R'(9 u 6 2 ),M'(9 1 ,e 2 )] 

J 

- f smh(^)R , (e 1 ,e 2 )}Tr exp[ _ 0R) (d9 2 ) 

where C is such that p(Q) = 1. We get 

X(p, p) = (3p{R) + 1P ® 7r exp( _^ fi) M') - f sinh(£)i?'] + 3C(p, tt) 



log 



exp 



-/3i2(6»i 



- f sinh(^)fl'(0 1 ,0 2 )}7r oxp( _^ ) (d0 2 ) 7r(<»i)| 



/3[p(i?) - 7Te X p(_/3iI)(-R)] 

+ 7P <8> 7r exp( _^ fl) (iJ', Af) - f sinh(^) J R'] 

+ ^(Pj TT) - K(7Toxp(-/3fl) , 7r) 

logi 



exp 



f {^ i [R\e 1 ,e 2 ),M'(d 1 ,e 2 )] 

J 

f smh(^)R / (9 1 ,9 2 )}7r cxp( _ m (d0 2 



7r oxp(-/3_R) 



Thus with P probability at least 1 — ry, 



(2.6) [iVsinh(^) - /?] [p(fl) - ^ xp( _^)(i?.)] 

< 7 [p(r) - 7rex P (-/3ie) M] + X(p, 7r) - ^(^^(-^fl) , tt) - log(j?) + C(/3, 7) 

' / {*^[i?'(0i,e 2 ),Af , (0 1 ,0 2 )] 

J 



where C(/3, 7) = log 



exp 
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sinh(^)i? / (0 1 ,0 2 )}^ex P (-/3.R)(d02 



7r GX P (-^fi)(^l) 



Remarking that 

X[p, 7T exp (_ | g ii )] = - 7r eX p(-/3ii)(-R)] + I") -3C(7Texp(-/3JJ) 5 7r)) 

we deduce from the previous inequality 

Theorem 2.1.7. For any real constants (3 and j, with P probability at least 1 — r], 
for any posterior distribution p : Q — > Mi_(0), 

7Vsinh(^)[p(i?) - 7r eX p(-/3ii)(-R)] < 7 [/>(>) ~ 7Tex P (-/3fl) (r)] 

+ 3C[p, 7r exp( _ / a R )] - log(?7) + C(/3, 7). 

We can also go into a slightly different direction, starting back again from equa- 
tion (|2.6l page I57[) and remarking that for any real constant A, 

A [p(r) - Tr cxp (-tm) (r)] + X(p, tt) - , tt) 

< Ap(r) +3C(/7,tt) +log{7r[exp(-Ar)] } = 3C[p, 7r oxp( „ Ar) ] . 

This leads to 

Theorem 2.1.8. For any real constants f3 and 7, with P probability at least 1 — r\, 
for any real constant X, 

[iVsinh(^) - 0\ [p(R) ~ n eM _ m (R)] 

< (7 - A) [p(r) - 7r eX p(_ /3iJ ) (r)] + %[p, 7r cxp( _ Ar) ] - log(r?) + C(J3, 7), 

where the definition o/C(/3, 7) ?s given by equation H2.61 page \57\ ). 

We can now use this inequality in the case when p = 7r cxp („ Ar ) and combine it 
with Inequality (|2.5l page [53]) to obtain 



Theorem 2.1.9 For any real constants (3 and 7, with P probability at least 1 — r\, 
for any real constant \, 

sinh(^) - 7] [7r exp( _ Ar) (r) - TT eM _ m (r)] < C{fi,i) - log (77). 

We deduce from this theorem 



Proposition 2.1.10 For any real positive constants (3\, P2 o,nd 7, with P probabil- 

i^sinh(^0 



ity at least 1—7?, for any real constants Ai and A2, such that A2 < fcjy sinh( -3^) 1 
and Ai > (3^ sinh(-£) _1 , 



Texp(-Air)0") - Texp(-A 2 r)W < 1"exp(-0iiJ) W ~ 7I"exp(-/3 2 fl) ( r ) 

(7(^,7) +log(2/r?) , C(/3 2 ,7)+log(2/r ? ) 



+ 



^sinh(^)- 7 7-^sinh(^) 
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Moreover, 7r eX p(— p x R) and TT exp f-/s 2 R) being prior distributions, with P probability 
at least 1 — r), 

7Kxp(-/3 1 i?)W -7r eX p(-/3 2 R)W] 

< TTToxpf-fti?) ® TTcxpt-fti?.) M')] - log(?y). 

Hence 

Proposition 2.1.11 For any positive real constants 0i, (3 2 o,nd 7, with P prob- 
ability at least 1 — n, for any positive real constants Ai and X 2 such that A2 < 
(i 2 ^ sinh(^)- 1 and \ x > foft sinh(^)' 1 , 



Texp(-Air)(»") - 7Texp(-A 2 r) 0") 



< TT. 



exp(-ftfl) 



texp(-/3 2 H) 



,(ii',M')] 



log(|) | C(/? 1 , 7 )+log(|) , C(/3 2)7 )+log(|) 



^sinh(^)- 7 ' 7-i^ s inh(^) 

In order to achieve the analysis of the bound -B(7r cxp (_ AlI ,), /3,7) given by Theo- 
rem 12.1.31 (page I54[) , it now remains to bound quantities of the general form 



logjTr 



exp( — Ar) 



exp{N log[cosh(-^)]vT, 



exp( — Ar) 



(m')}]} 



= sup iVlog[cosh(^)]p(g)7r oxp( _ A) (m') -3C[p,7r cxp( _ Ar) ]. 

Let us consider the prior distribution /j g M^(9 x 6) on couples of parameters 
defined by the density 



d{lT <X> 7r) 



l, 6 2 ) = C'- 1 cxp{- f3R{0i) ~ W) + [M'(e u 9 2 )] }, 



where the normalizing constant C is such that /_t(0 x 0) = 1. Since for fixed values of 
the parameters and 6' S 0, m'(6,6'), like r(0), is a sum of independent Bernoulli 
random variables, we can easily adapt the proof of Theorem 11.1.41 on page [4] to 
establish that with P probability at least 1 — 77, for any posterior distribution p and 
any real constant A, 



ap(g)7r ex p(_ Ar )(m') < ap <x> 7r oxp( _ Ar) [$_a.(Af')] 



7Toxp(-/3fl)] + ^[ 7r cxp(-Ar): ^oxpt-^i?) 

+ 10 g| 7r cxp(-/3_R) ® 7r cxp(-^)fl) 



+ %{p <8> 7r cxp( _ Ar) , /i) - log(?y) 



exp(a$_ooM') J -logfa). 



Thus for any real constant f3 and any positive real constants a and 7, with 
probability at least 1 — 77, for any real constant A, 



(2.7) l0g{7T, 



exp( — Ar) 



exp { N log [cOSh( )] 7T exp ( _ Ar ) 

K)}]} 



< sup ( ■^log[cosh(-^)]|aC[p,7r oxp( _ /3fl) ] +3C[7r oxp( _ Ar) ,7r cxp( _ /3fl) ] 
P eM I f (e)\ L 
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log{7r exp (_ |3 H) ® 7r cxp (-/3i?.) [exp(a$_°oM')] } 



To finish, we need some appropriate upper bound for the entropy 
0C[p, 7Texp(-0_R)] • This question can be handled in the following way: using The- 
orem [5TTT7| (page [55]), we see that for any positive real constants 7 and (3, with P 
probability at least 1 — 77, for any posterior distribution p, 

^[p,^ eX p(-0R)] = P[p(R) ~ 7Texp(-/3H)(-R)] + ^-(P,^) ~ ^{^cx P (-/3R),^) 



< 



iVsinhP 



JY - 



l[p(r) ~ 7r exp( _^ fl) (r)] 

+ 3C[p,7r exp( _ (3J?) ] - log(?y) + C((3,y) 

+ 7r) - 3^(7r C xp(-^fl) , tt) 



< 3C[p, 7T 
— L' " 1 



exp(- 



+ 



7Vsinh(^ 



JV 



+ C(/3 l7 )-log(7 ? )}. 



In other words, 



Theorem 2.1.12. For any positive real constants [3 and 7 such that (3 < N x 
sinh(-^) 7 with P probability at least 1 — 77, /or any posterior distribution p : Q, — > 



K[P) 7r cxp(-^fl)] 



< 



3C[p,7roxp[-/3isi„h(i)-ir]] , C(/3, 7) - log(ry) 



1 - 



/3 



iVsinh(^) 



jVsinh(^) 

/3 



where the quantity C(/3, 7 ) is defined by equation \2.6[ page \57\ ). Equivalently, it will 
be in some cases more convenient to use this result in the form: for any positive real 
constants A and 7, with P probability at least 1 — r), for any posterior distribution 

p-.n^MKe), 

^ ^ X[p,n mp{ _ Xr) ] g(Af S inh(^), 7 ) -106(7?) 

•M^expl-Af sinh(#)fl]J S " A 1 A : • 



Choosing in equation (|2.7i page[59j) a 
N log[cosh(- 



jVlog[cosh(^)] 

1 - z 3 

JVsinh(^) 



and/3 = Af sinh(^), 



m) 1 l)<H n = ^ — A v — , we obtain with P probability at least 1 — 77, 



log|7r exp( _ Ar) exp{7Vlog[cosh(-^)]7r oxp( _ Al ,)(m')} | 
<M [<?(/?, 7 ) + ]og(2)] 



l0g|7T exp (_ /3i? ) <g> 7T exp (_ / g iJ ) [exp(fl$_ » oM')] J 
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+ log(f) 



This proves 



Proposition 2.1.13. For any positive real constants A < 7, with P probability at 
least l — 



log{7r exp( _ Ar) exp{7Vlog[cosh(-^)]7r cxp( _ Ar) (m')}] } 



2A 



<-[C(^sinh(_L), 7 )+log(f)] 



+ 



(l ^)l0g|7T® 2 p[ _^A sinh( ^ )i?] 

expl 



fJVlog[cosh(#)] 



V 1 - - 

7 



$ i= B [c»h(A)] °M' 



1-^ log(^). 



We are now ready to analyse the bound -B(7r e xp(-Air), f3, 7) of Theorem 12.1.31 
(page [ST 



Theorem 2.1.14. For any positive real constants Ai, X2, Pi, P2, P and 1, such 
that 



Ai < 7, 
A2 < 7, 



/J l <^sinh(^) 



7 

N\ 2 



/? 2 >^sinh(^), 
/3<^tanh(^), 



7 \JV' 

wii/i P probability 1 — 77, f/ie bound B(ir oxp (_\ ir ), /?, 7) 0/ Theorem \2.1.3\ (page \5J$ 



S(7r cxp( _ AlI .),/3,7) 



< (7~ AiK 7r exp( _ ftfl) ®7r eS p(_ / j aJl )[*_i(ii',Af / )] + 



C(/3 1; 7) + log(p C(/3 2 ,7) + log(p 
^sinh(^)- 7 + 7-^sinh^) 



2Ai 

7 



C(^sinh(^), 7 )+log(p 



+ i 1 t) io 4- 



oxp[-^Lsinh(^.)fl] 

exp 



jVlog[cosh(-X)] 



$ l= B [co.h(i)] OM' 



+ (l-^jlog(l)-log[K{/8})K{7}> 



2A 



+ (7 - AOfe-L ^ sinh^), 7) + Iog(Z; 
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exp 



iVlog[cosh(^)] 

lo g [oo B h(i)] °M 



A2 

7 



l-^)log(2)-tog[i/({/?}M{7})e] 



where the junction C(/3, 7) is defined by equation 12. 6\ page \57\ ). 



2.1.4- Adaptation to parametric and margin assumptions 

To help understand the previous theorem, it may be useful to give linear upper- 
bounds to the factors appearing in the right-hand side of the previous inequality. 
Introducing 9 such that R(9) = infe R (assuming that such a parameter exists) 
and remembering that 



*- a (p, m) < a 1 sinh(a)p + 2a 1 sinh(§) 2 ra, 
$-o(p) < a -1 [exp(a) - l]p, 

^(.P, m) > a -1 sinh(a)p — 2a _1 sinh(|) 2 m, 
M'(9 1) 9 2 )<M'(9 1 ,9)+M'(9 2 ,9) > 

M'(9 U 9) < xR'(9 1 ,9) + (p(x), 



a G K+, 
a G K+, 
a G R+, 

6x,9 2 e6, 
ac G K+,6»i G 6, 



the last inequality being rather a consequence of the definition of tp than a property 
of M' , we easily see that 

< f sinh(^)[7r cxp( _ /3li?) (i?) -TTcxpf-fei?)^)] 

+ M S inh(2^) 2 7r cxp( _ /3li?) <g> 7r exp( _ &iJ) (M') 

< f smh (^)[ 7r oxp(- / 3 1 fl)( J R) ™ 7Texp(-/3 2 fl)(-R)] 

2cc7V 



7 



sinh (2w) 2 { 7r cx P( -/3 1 fl) [R'(;Q)] + ^p(-p 2 R) [R'{; 0)] } 



that 

C(/3,7) < log J ^(.^{exp 



2iVsinh(2^-) 7r oxp( _ /3i?) 



(M')]} 



1"^ ^ cxp( _ /3fl) {exp[27Vsinh(^) 2 M'(-,0)] } 

+ 27Vsinh(^) 2 7r cxp( _ /3i?) [M'(-,0)] 
^ ^exp(-/3i?) {exp [2a N sinh(^) 2 i?'(-, 0)] } | 

+ 2xN S mh(^) 2 7r cxp ^ m [R'(-,9)] + 47Vsmh(^)V(x) 



/3-2xAfsinh(^) 2 



Trexp(-aiJ) [R'(-,0)]da 
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+ 2a;iV S inh( 5 ^) 2 7r exp( _ /3iJ) [B?(;6)] + 4Nsmh(^) 2 tp(x) 
< 4x7Vsinh(2^) 2 7r oxp[ _ (( g_ 2;1 . A r sinh( _7. ) 2 )it] [#'(-, 0)] 



and that 



< 2l0g|7T cxp( _ /3fl) 



exp(N[exp(a) -i\M'(-,6f) } 

< 2xiV[exp(o!) - l]Tr e x P [-(p- X N[ e xp(a)-i})R][R' (■,())] 

+ 2xN[exp(a) - l]<p(x). 



Let us push further the investigation under the parametric assumption that for 
some positive real constant d 



(2.8) 



lim /37T e xp(-/3iJ)[-R'(-,0)l 

P — + OC 



This assumption will for instance hold true with d = ^ when i? : 9 — > (0, 1) is a 
smooth function defined on a compact subset O of 1" that reaches its minimum 
value on a finite number of non-degenerate (i.e. with a positive definite Hessian) 
interior points of 0, and 7r is absolutely continuous with respect to the Lebesgue 
measure on and has a smooth density. 

In case of assumption (|2.8p , if we restrict ourselves to sufficiently large values of 
the constants (3, (3i, 02, \i, A2 and 7 (the smaller of which is as a rule f3, as we 
will see), we can use the fact that for some (small) positive constant S, and some 
(large) positive constant A, 



(2.9) 

Under this assumption 



^(1 -5)< 7T oxp( „ Qfl) [R'(; 9)] < ^(1 + 6), 



a> A. 



Kcxpi-^R) ® 7Tcxp(-/3 2 _R) [*-i(i?', M')] 

+ ^ sinh(^) 2 (l + 5) [ ± + £ + f sinh(^) M*). 

^, 7 )<4i^)iog( Hrf L ( ^ ) 

+ 2a;7Vsinh(^) 2 M^+47Vsinh(^) 2 ^(x). 



< 



2xiV"[exp(a) - l] 



d(i + <y) 



+ 27V[exp(a) - l]<^(x). 



- xiV[exp(a) - 1] 
Thus with P probability at least 1 — ry, 

B(7r exp( _ Air) ,/3,7) < "(7 "Ai)f sinh(^)^(l-<S) 
+ (7 - Al ){f S inh(^)fi±^ 

+ *f- sinh(^) 2 (l + <5) [A + £] + M sinh^) 2 ^) 



G4 



Chapter 2. Comparing posterior distributions to Gibbs priors 



+ 



4a;j V sinh (^)2 __Jl^_ +4jVsinh(^)VW +log(p 



+ 



^sinh(#)- 7 



+ — (4xiVsinh(^) 2 



7-^sinh^) 



7 I 



(1+j) d 



sinh^-^JVsinh^P 



+ 4iVsinh(^)V(x)+log( 



(?)} 



7 • 



)|2d(i + <y) 


' Aisinh(-^-) 




exp 


f logJcOBht^)] ^ 


-i 



+ 2N 



exp 



( ' og| ;:¥" )-fM 



(l-^)log(l)-log[^({/3})K{7})e] 



+ 



1- 

7 



^tanh(^)-H 7 



, sinh( 1 ^)-2 2: Ar s inh( I L)2 



+ 4iVsinh(^)V(x)+log(I)| 



+ 



2d(i + <y)| 



X7 



A 2 sinh(-j^) 



, log[co 3 h(-2.)] 



+ 2iV 



exp(^li)l)_i] v(a;) 



+ (l-4f)log(p-log[K/3M7) £ ]l. 



Now let us choose for simplicity /?2 = 2A 2 = 4/3, /?i = Ai/2 = 7/4, and let us 
introduce the notation 



C 1 = -sinh(^), 

TV 7 
C 2 = -tanh(-M, 

7 iv 



iV 2 r 7 2 

^3 = — [exp(^)-l] 



7' 



27V 2 (1 - 2£ 
and C4 = 2 — ~ I ! 1 ' 



2AT2(i _ M) 



to obtain 



C17 

B(7r oxp( _ Air) ,/3,7) < -- — (1 - <5)d 
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-1 



+ 



1 



2Ci - 1 



+ 



1 



2-Ci 



2(1 + ^(^-1)" +C 1 ^^)+log(I) 



+ 2»gl ± ^ +Ci ^ (sc) + 



N — x-f 



+ d(l + 6) 



+ [ 4C 2 - 2 



V2C 3 TV 



4/3 f 7 



+ ^C 3 ^) + ^-log[K/3)K7)e] 



iV 1 



- L N C 1 (l + S)d(20C 1 - X C l2Nj 

+ ^(x)+log(I)} 



4/3Ci 
7C4 



2£\_x2 

N 



N{1 1 _2_ l fMx) 



+ 



(l-f) log(I)-log[K/3M7) £ ] • 



This simplifies to 



S(7r exp( _ Air) ,/3, 7 )<-^(l-<5)^ 

+ 2C 1 (l + <5)d + log(|) 2 + 



3d 



+ 



7 



(4Ci-2)(2-Ci) 1 4C 2 _ 2 

- (l + lo 4K/?M7)e 

~r TV \ 2C ' 1 - 1 v 2C i N J 

+ 2(l- 7 f)" + te-7f Vl -^ M 



+ 



+ 



(1 + 5)dxY 



{ 



Ci 



+ 



16 2-Ci 



/ _8_ _ £7!.^ 1 
V Ci AT/3 j 

4. f 1 - 1 [ 4C! _ 2/3 \ 

^ 7 y 2c 2 -i L c 4 v t J 



7 (4C 2 -2) 



7 X 

7W 



+ t (p{x) ^iq L + _^_ + _Q b _ + C3 + 



4/3 + C 4 



7 (4C 2 -2) ^ 4C 2 -2 



This shows that there exist universal positive real constants A\, A 2 , Bi, B 2 , B 3 , 
and B4 such that as soon as 7ma ^- a: ' 1 ^ < Ai@- < A 2 , 



5(7r exp( _ Air) ,^ 7 ) < -B 1 (l-S)dj+B 2 (l + 6)d 
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-i 2 

B 3 log[i/( ) 9)i/(7)e7 7 ] + B 4 ^-tp{x). 



" N 



Thus 7r oxp( _ Air )( J R) < 7T exp( _^ i j)( J R) < inf e R + (1 ^f )rf as soon as 



7 R (l+g) , B 4 ^ r y(a;)-B3 logKp>(7)»7] 
- D2 (l-5) "I" (l-«)d 



Choosing some real ratio a > 1, we can now make the above result uniform for 
any 



(2.10) /?, 7 e A Q ^ {a fe ;fe £ N,0 < k < 

log(a) 
log(aTV) 



by substituting v{(3) and 1/(7) with ^"f^s and — log(^) with — log^) + 2 x 



log 



log(aiV) 
log(a) 



Taking 77 = e for simplicity, we can summarize our result in 

Theorem 2.1.15. There exist positive real universal constants A, B\, B2, S3 
and B4 suc/i i/iai for any positive real constants a > 1, d and S, for any prior 
distribution ir £ M + (0), with P probability at least 1 — e, /or any /3, 7 G A Q (where 
A a is defined by equation (|2.10p above) such that 



sup 

/3'SK,/3'>/3 



■j [7r exp (-/9'fl) (#) - inf - 1 



< 6 



and such that also for some positive real parameter x 

7max{izi,l} A/3 B x 
< and — < 



N 7 7 (1+5) , Bit- V {x)-2B 3 \o & {e)+4B 3 



(i±£) 

^' Z (l-S) ' (l-«)d 

t/ie bound B(7r cxp (_z r ), /3, 7) given by Theorem \2.1.3\ on page \54\ in the case where we 
have chosen v to be the uniform probability measure on A a , satisfies -B(7r oxp (_.j r ), (3, 

7) < 0, proving that /3(7r exp (_a r )) > /3 and therefore that 

Kcxpi-j^iR) < ^ C xp(-0R){R) < mSR + - + ^ ■ 

What is important in this result is that we do not only bound 7T exp (_ j r )(-R), 
but also B(ir exp (_T r ), f3, 7), and that we do it uniformly on a grid of values of (3 
and 7, showing that we can indeed set the constants (3 and 7 adaptively using the 
empirical bound S(7T exp (_a r ),/3,7). 

Let us see what we get under the margin assumption (|1.24[ page [39]) . When 
K = 1, we have </j(c _1 ) < 0, leading to 

Corollary 2.1.16. Assuming that the margin assumption \1.24\ vaae \ 39\) is sat- 
isfied for ft = 1, that R : — ► (0, 1) is independent of N (which is the case for 
instance when P = P® N ) I and is such that 

lim [7r exp (_ /9 ' H ) (i?) - inf i?] = d, 

p — >+oo O 
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there are universal positive real constants B§ and Bq and N± G N such that for any 
N > N\, with P probability at least 1 — e 



, , B 5 d 
exp(-72) v ; - e cN 



B 6 f log(N) 



where 7 G argmax 7£ A 2 max{/3 G A2; -B(7r Gxp (_ 7 r-), /3, 7) < 0}, where A2 is defined 
by equation 12.101 vaoe 1 6'6]). and B is i/ie bound of Theorem \2.1.3\ (page \54\ ). 

When k > 1, <p(x) < (1 - kcx) 5 1 , and we can choose 7 and x such that 

2 

j^<£>(x) ~ <i to prove 



Corollary 2.1.17. Assuming that the margin assumption \1.24\ vaae \39\) is sat- 
isfied for some exponent k > 1, £/ia£ R : G — ► (0, 1) is independent of N (which is 
for instance the case when P = P® N J s and is such that 



lim P'[ir exp ^p, R) 



(R) -MR] 




i/iere are universal positive constants B7 and B% and N± G N such that for any 
N > Ni, with P probability at least 1 — e, 



exp(-7§) 



(R) < inf R + B 7 c~ 



i + ^/log(A0 
d \ e 



where 7 G argmax 7S A 2 max{/3 G A2; -B(7r cxp (_ 7 |), /3, 7) < 0}, A2 being defined by 
equation \2.1(K vaoe 1 6'6]j and _B 6?/ Theorem \2.1.S\ (page \5l$ . 

We find the same rate of convergence as in Corollary 11.4.71 (page FUJI) , but this 
time, we were able to provide an empirical posterior distribution 7i" exp j_^><j which 

achieves this rate adaptively in all the parameters (meaning in particular that we do 
not need to know d, c or n). Moreover, as already mentioned, the po wer of 7V in this 
rate o f convergen c e is k n own to be optimal in the worst case (see iMammen et al 



1999h-|Tsvbakovl (|2004l ): lTsvbakov et all (|2005h . and more specifically in lAudibert 
2004bl ) — downloadable from its author's web page — Theorem 3.3, page 132). 



2.1.5. Estimating the divergence of a posterior with respect to a Gibbs 
prior 

Another interesting question is to estimate 0C[p, TTexpf-ff ffl] using relative deviation 
inequalities. We follow here an idea to be found first in ( Audibert . 2004b, page 93). 
Indeed, combining equation (|2.3l page 152")) with equation (|2.1[ page [?T|) . we see that 
for any positive real parameters (3 and A, with P probability at least 1 — e, for any 
posterior distribution p : — > Mi_(0), 







+ N log[cOSh(^)] p <g> lt C K V (-t3R) (m') 

+ 3C[p,7r oxp( _ /3K) ] -log(e) [ +%(p,tt) - 3C[7r cxp( _ /3fl) , tt] 
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<K[p,7T exp[ _ 







u^y^ iVtanh(^) 



-^-laC[p,7r cxp( _ /3i?) ] -log(e)} 



log 



cxp[ 



We thus obtain 



Theorem 2.1.18. For any positive real constants (3 and 7 such that (3 < N x 
tanJa(^), with P probability at least 1 — e, for any posterior distribution p : fi — > 

MV(e) 7 



x |3 < C[p,7r exp[ _^ tanh( ^ ) - lr] 

+ 1 °g{ 7r cx P[ -# 



iVtanh(^) l0g(£) 



[_^ta„h(i)-ir] 



exp{/3tanh(^)- 1 log[cosh(^)] /9 (m')}] } 



This theorem provides another way of measuring over-fitting, since it gives an 
upper bound for ^[^ exp [-^ tajih( T )- 1 r]> 7r ex P (-/3.R)] ■ K ma Y t> e use d in combination 
with Thcorcm ll.2.61 (pagefTT j) as an alternative to Theorem 1 1.3. 71 fpage [2Tj) . It will 
also be used in the next section. 

An alternative paramctrization of the same result providing a simpler right-hand 
side is also useful: 

Corollary 2.1.19. For any positive real constants f3 and 7 such that (3 < 7, with 
P probability at least 1 — e, for any posterior distribution p : fl — > M+(0). 



exp[-JV|tanh(4.)Ji]. 



<ll-<? 



K[p,7Tcxp(-/3r)] - ^ lo g( e ) 



log{7r exp (-/3r) exp{A^log[cosh(-^)]p(m')} | 



2.2. Playing with two posterior and two local prior distributions 
2.2.1. Comparing two posterior distributions 

Estimating the effective temperature of an estimator provides an efficient way to 
tune parameters in a model with parametric behaviour. On the other hand, it will 
not be fitted to choose between different models, especially when they are nested, 
because as we already saw in the case when O is a union of nested models, the prior 
distribution K C xp(-(3R) does not provide an efficient localization of the parameter in 
this case, in the sense that ir C x P (-i3R){R) does not go down to infe R at the desired 
rate when (3 goes to +00, requiring a resort to partial localization. 

Once some estimator (in the form of a posterior distribution) has been chosen 
in each sub-model, these estimators can be compared between themselves with the 
help of the relative bounds that we will establish in this section. It is also possible 
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to choose several estimators in each sub-model, to tune parameters in the same 
time (like the inverse temperature parameter if we decide to use Gibbs posterior 
distributions in each sub-model). 

From equation (|2.2| page [52|) (slightly modified by replacing ir ® ir with ir 1 Cg) tt 2 ), 
we easily obtain 



Theorem 2.2.1. For any positive real constant A, for any prior distributions 
ir 1 , ir 2 E M^(0) 7 with P probability at least 1 — e, for any posterior distributions p\ 
and p 2 : Q -> M\.(Q), 



- AHogjl - tanh(A) [p 2 (i?) - pi (i?)j } < A [p 2 (r) - Pl (r)] 
+ iVlog[cosh(A)] 

pi <8> Piijn) 
+ %{ Pll TT l ) +X(p 2 ,n 2 ) -log(e). 

This is where the entropy bound of the previous section enters into the game, 
providing a localized version of Theorem 12.2.11 (page l69|) . We will use the notation 



(2-11) 



S a (g) = tanh(a) 1 [l - exp(-aq)] < 



tanh(a) 



q, 



a,q£ 



Theorem 2.2.2. For any e E)0, 1(, any sequence of prior distributions (7T*)j e n E 
M^(0) N , o,ny probability distribution p on N, any atomic probability distribution v 
on R_|_, wzt/i P probability at least 1 — e, for any posterior distributions pi, p 2 '■ Q — > 
M 1 , ((-)), 



p 2 {R) - pi(R) < B(p 1 ,p 2 ), where 



B(pi,p2)= inf Sa< Tp2(r) - /»i(r)l 



f togfcoshOj^jpigi/^m') 



A l-ei 
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^•[Pl'^expC-zSir)] 

+ log{< xp( _^ ir) [exp{/3if log[cosh(£)] Pl (m')}] } 



_^log[K7i)]} 



A[l-£ 
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3C [^,< X p ( _ fcr) ] 



to e{"icp(-Ar) [ ex P{^f log[cosh(t)]p 2 (m')}] } 



72 



log [1/(72)] 



(ft - + (S - !)" + 



log[3-M/3iM/32)KAM»Mj> 
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The sequence of prior distributions (7r l )igN should be understood to be typically 
supported by subsets of O corresponding to parametric sub-models, that is sub- 
models for which it is reasonable to expect that 

lim p [tt* (_/3 R) (R) - ess inf R] 

exists and is positive and finite. As there is no reason why the bound B(pi, pi) pro- 
vided by the previous theorem should be sub- additive (in the sense that B(pi, p^) < 
B(pi, P2)+B(p2, pa)), it is adequate to consider some workable subset CP of posterior 
distributions (for instance the distributions of the form 7Texp(-/3r)> * £ N, /3 G 
and to define the sub-additive chained bound 



(2.12) B(p,p') = Ml^B(p k , Pk+l );ri£N*,(p k )t =0 E CP r ' 

lfc=0 



P0 = P, Pn = P' \ , P, P' G 7. 



Proposition 2.2.3. With P probability at least 1 — e, for any posterior distribu- 
tions pi,p2 £ CP, P2{R) — Pi(R) < B(pi,p2). Moreover for any posterior distribution 
Pi G CP, any posterior distribution p2 G CP such that B{p\ 1 P2) = inf P3e g) B(pi, p%) is 
unimprovable with the help of B in CP in the sense that infp 3e y B(p2, P s) > 0. 

Proof. The first assertion is a direct consequence of the previous theorem, so only 
the second assertion requires a proof: for any p^ G CP, we deduce from the optimality 
of p2 and the sub-additivity of B that 

B(pi,p 2 ) < S(pi,p 3 ) < B(p 1 ,p 2 ) + B(p 2 ,P3)- 

□ 

This proposition provides a way to improve a posterior distribution pi G CP by 
choosing p2 G argmin pg y B(pi, p) whenever B(p\,p2) < 0. This improvement is 
proved by Proposition 12.2.31 to be one-step: the obtained improved posterior p 2 
cannot be improved again using the same technique. 

Let us give some examples of possible starting distributions p\ for this improve- 
ment scheme: p\ may be chosen as the best posterior Gibbs distribution according 
to Proposition 12.1.51 (page [56]) . More precisely, we may build from the prior distri- 
butions 7T 4 , i G N, a global prior 7r = X^ieN / i M 7r ' '■ We can then define the estimator 
of the inverse effective temperature as in Proposition 12.1.51 (page l56j) and choose 
pi G argmin pe y P{p), where CP is as suggested above the set of posterior distribu- 
tions 

J , = {<xp(-^ ) ;*gn,/3gR + }. 

This starting point pi should already be pretty good, at least in an asymptotic 
perspective, the only gain in the rate of convergence to be expected bearing on 
spurious log(iV) factors. 

2.2.2. Elaborate uses of relative bounds between posteriors 



More elaborate u ses of relative bo unds are described in the third section of the 
second chapter of lAudibertl (|2004bl ). where an algorithm is proposed and analysed, 
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which allows one to use relative bounds between two posterior distributions as a 
stand-alone estimation tool. 

Let us give here some alternative way to address this issue. We will assume for 
simplicity and without great loss of generality that the working set of posterior 
distributions CP is finite (so that among other things any ordering of it has a first 
element). 

It is natural to define the estimated complexity of any given posterior distribution 
p G CP in our working set as the bound for infigN CK(p, 7r l ) used in Theorem 12.2.11 
(page [69| . This leads to set (given some confidence level 1 — e) 



e( P )= inf \x[p,4 

+ logj^^ [exp{/3f log[cosh(#)]p(m')}] } 



cxp(— f3r) J 



-^io g [3-y 7 )K/%«' 

7 

Let us moreover call "f(p), (3{p) and i(p) the values achieving this infimum, or 
nearly achieving it, which requires a slight change of the definition of C(p) to take 
this modification into account. For the sake of simplicity, we can assume without 
substantial loss of generality that the supports of v and p are large but finite, and 
thus that the minimum is reached. 

To understand how this notion of complexity comes into play, it may be inter- 
esting to keep in mind that for any posterior distributions p and p' we can write 
the bound in Theorem 12.2.21 (page as 

(2.13) B(p, p') = inf S x [p'(r) - p(r) + S X ( P: p 1 )] , 

where 

w) = w,p) < y iog[cosh(A )]p ^' K) + m±m _ *M 

]o e {v[fi{p)]ii[i(j>)]} log{ ^ ,[/3(p')]M[^(p , )]} 



\(i P(jLL\ Xfl - ^H) 

fiM - ir 1 + (i^X - iv 1 + 1 



Iog[i/(A)] 



A 



(Let us recall that the function S is defined by equation (|2.1U page [69]).) Thus for 
any p, p' such that B(p' ', p) > 0, we can deduce from the monotonicity of Si that 

p\r)-p{r) < inf S x (p,p'), 

proving that the left-hand side is small, and consequently that B(p, p 1 ) and its 
chained counterpart defined by equation (|2.12[ page l70|l are small: 

B(p,p') < B(p,f/) < inf E,[2S x (p,p')]. 

It is also worth noticing that B(p,p') and B(p,p') are upper bounded in terms of 
variance and complexity only. 
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The presence of the ratios should not be obnoxious, since their values should 
be automatically tamed by the fact that /3(p) and j(p) should make the estimate 
of the complexity of p optimal. 

As an alternative, it is possible to restrict to set of parameter values (3 and 7 
such that, for some fixed constant £ > 1, the ratio ^ is bounded away from 1 by 
the inequality ^ > (. This leads to an alternative definition of Q(p): 



e(p)= inf (1--) locUw, 

+ log{< xp( ^ r) [exp{/3f log[cosh(#)]p(m')}] } 



T exp(-/3r)J 



Pi r „_i , , m , n \ Iog[i/G9)/i(i)] log^" 1 * 
-log[3 L v(j)v((3)p(i)e l 1 



7 ° L v " — w jj (1-C 1 ) 2 ' 

We can even push simplification a step further, postponing the optimization of the 
ratio 5, and setting it to the fixed value This leads us to adopt the definition 



(2.i4) e( P ) 



inf (1-C 1 ) l \x[p, < X p(-/3r)l 



+ 



log{< xp( -^) [ex P { f log[cosh(f )] p(m') }] } 

- ^±l{log[K/%«] + 2- 1 log^e)}- 

With either of these modified definitions of the complexity C(p), we get the upper 
bound 

(2.15) S x (p,p') < S x {p, P ') d =l f ^log[cosh(±)]p® p'(m') 



+ X 



I|e(p) + e( P ')-^iog[KA)]}. 

With these definitions, we have for any posterior distributions p and p' 
B(p,p') < A inf Z±{p'(r) - p(r) + S A (p,p')}- 

Consequently in the case when B(p',p) > 0, we get 

B(p, p') < B(p, p') < M + E, [2S X (P, P')] ■ 

To select some nearly optimal posterior distribution in T, it is appropriate to or- 
der the posterior distributions of T according to increasing values of their complex- 
ity G(p) and consider some indexation T = {pi, . . . ,pm}, where G{pk) < C(Pfc+i), 
1 < k < M. 

Let us now consider for each pk € J 1 the first posterior distribution in V which 
cannot be proved to be worse than p)~ according to the bound B: 

(2.16) t(k)=mm{j €{!,... M} : B( Pj ,p k )>o}. 
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In this definition, which uses the chained bound defined by equation (|2.12i page 
1701) . it is appropriate to assume by convention that B{p 1 p) — 0, for any posterior 
distribution p. Let us now define our estimated best p 6 T as p^, where 

(2.17) k = min(argmaxt). 

Thus we take the posterior with smallest complexity which can be proved to be bet- 
ter than the largest starting interval of T in terms of estimated relative classification 
error. 

The following theorem is a simple consequence of the chosen optimisation scheme. 
It is valid for any arbitrary choice of the complexity function p G(p). 

Theorem 2.2.4. Let us put t = t{k), where t is defined by equation (|2.16[) and k 
is defined by equation (|2.17p . With P probability at least 1 — e, 

'0, l<j<t, 

B(Pj,Pt) + B {f>tiPpi J e (argmaxt), 
. B {pj>P%), 3 e {k+ 1, . . . ,M) \ (argmaxt), 

where the chained bound B is defined from the bound of Theorem \2.2.2\ (page\U@$ 
by equation \2.12[ page |70[ ). In the mean time, for any j such that t < j < k, 
t(J) < t — maxi, because j £" (argmaxt). Thus 

P^(R) < Pt(j)(R) < Pj(R) + inf E^[2Sx( Pj ,p m )] 

while p t (j){r) < Pj{r) + inf S\(pj, p tU) ), 

where the function 5 is defined by equation \2.11\ page \ 69\) and S\ is defined by 
equation \2.1 C A page \71ty . For any j G (argmaxt), (including notably k), 

B(p t ,p j )>B(p T ,p j )>0, 
B(p j ,p t )>B(p j ,p ? )>0, 

so in this case 

P%{R) < Pj (R) + A ip R f + s a S\{pj, p$ + S\(f>p p^) + S x (pj , p~) , 
while ptfr) < pj(r) + inf Sx(pj,p^j, 

A t M-j- 

Pt(r) < p^r) + inf S x {p£,Pt), 
and p^R) < pj(R) + inf S x [2S x (pj,p$] ■ 

Finally in the case when j £ |fc + 1, . . . , M} \ (argmaxt), due to the fact that in 
particular j (argmaxt), 

B{p li , P3 )>B{ n ,p j ) >0. 

Thus in this last case 

n {R) < Pj (R) + M Sa [2S x ( Pj ,p^] , 



while pr{r) < pj(r) + inf S\{p^pt). 
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Thus for any j = 1,...,M, p^(R) — pj (R) is bounded from above by an empirical 
quantity involving only variance and entropy terms of posterior distributions pi such 
that I < j, and therefore such that C(pe) < G(pj)- Moreover, these distributions pi 
are such that pi(r) — Pj(r) and pi(R) — Pj{R) have an empirical upper bound of 
the same order as the bound stated for P?(R) ~ Pj{R) — namely the bound for 
pe(r) — Pj(r) is in all circumstances not greater than S^ 1 applied to the bound 

N 

stated for p?(R) — Pj{R)> whereas the bound for pe(R) — Pj(R) is always smaller 
than two times the bound stated for P^(R) — Pj{R)- This shows that variance terms 
are between posterior distributions whose empirical as well as expected error rates 
cannot be much larger than those of pj . 

Let us remark that the estimation scheme described in this theorem is very 
general, the same method can be used as soon as some confidence interval for the 
relative expected risks 

—B(p2,pi) < P2(R) — Pi(R) < B(pi 7 p 2 ) with P probability at least 1 — e, 

is available. The definition of the complexity is arbitrary, and could in an abstract 
context be chosen as 

e( Pl )= inf B(p 1 ,p 2 )+B(p 2 ,p 1 ). 

P2#Pl 

Proof. The case when 1 < j ' < t is straightforward from the definitions: when 
j < t, B(pj,p^) < and therefore P^(R) < Pj(R)- 

In the second case, that is when t < j < k, j cannot be in argmaxt, because of 
the special choice of k in argmaxi. Thus t(j) < t and we deduce from the first case 
that 

PZ(R) < Ptu)(R) < Pj {R) + B{pj,p m ). 

Moreover, we see from the defintion of t that B(p t (j^,pj) > 0, implying 
Pt(j){r) < Pj{r) + A mf + S x (pj, p t (j)), 

and therefore that 

n {R) < Pj (R) + inf [2S x ( Pj ,p t{j) )] . 

In the third case j belongs to argmaxi. In this case, we are not sure that 
B(p^,pj) > 0, and it is appropriate to involve t, which is the index of the first 
posterior distribution which cannot be improved by p^, implying notably that 

B(p^,pk) > for any k e argmaxi. On the other hand, p^ cannot either improve 
any posterior distribution pk with fc G (arg max t) , because this would imply for any 
I < t that B(pi, p~) < B(pe, pu) + B(pk, pj) < 0, and therefore that t(t) > t + 1, in 
contradiction of the fact that t = maxi. Thus B(pk,p~) > 0, and these two remarks 
imply that 

p^r) < Pj (r) + w£ Sx(pj,p?), 
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and consequently also that 

/ ^( J R)< ft (i?)+5( ftl pj) 

<p j (R) + inf Sa Sxipj^+Sxip^p^+Sxipj,^) 

and that 

P^i?) < + A mf + S x [2S x (p JlPT )} < p 3 {R) + 2 mf + 2H a [5 A ( ft)Pr )] , 

the last inequality being due to the fact that S a is a concave function. Let us 

notice that it may be the case that k < t, but that only the case when j > t is to 
be considered, since otherwise we already know that P^(R) < Pj{R)- 

In the fourth case, j is greater than k, and the complexity of pj is larger than the 
complexity of p£. Moreover, j is not in argmaxi, and thus B(p^,pj) > 0, because 

otherwise, the sub-additivity of B would imply that B(pg, pj) < for any I < t and 
therefore that t(j) > t = maxi. Therefore 

P%(r) < Pj{r) + in£ S x { P] ,p~), 

and 

n {R) < Pj (R) + B(p 3 ,p^) < p 3 (R) + mf + H x [2S x ( Pj ,p-)] . 

□ 



2.2.3. Analysis of relative bounds 



Let us start our investigation of the theoretical properties of the algorithm described 
in Theorem 12.2.41 (page I73|) by computing some non-random upper bounds for 
B(p,p'), the bound of Theorem 12.2.21 (page l69|). and C(p), the complexity factor 
defined by equation (|2.141 page [75]), for any p, p' G CP. 
This analysis will be done in the case when 



{<xp(-/Jr) : >0,p(i) >()}, 



in which it will be possible to get some control on the randomness of any p e CP, 
in addition to controlling the other random expressions appearing in the definition 
of B(p,p'), p, p' 6 CP. We will also use a simpler choice of complexity function, 
removing from equation (|2.14l page I72p the optimization in i and (3 and using 
instead the definition 



(2.i8) e(< xp( _^ r) ) d ^ f (i - cT 1 logj^p^,,) 

exp{f log[cosh(f)]< xp( ^ r) (m')} 



C + i 



-\og[v{l3)u{i)\. 



With this definition, 
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N 



)< T log[cosh(A)] 7r , 



cxp(-pY) ® n exp(~P'r) 



exp(— pY) J 



exp(— /3'r) J 



+ |^'oi[3-'KA) £ ], 
where 5a is defined by equation (|2.13i page [7Tj) , so that 

B [Kxp(-0rp'4xp{-0'r)] = ™£ + S ^{ 7r oxp(- / 3'r)( r ) ~ ^(-pY) M 

+ ^ [ 7r oxp(-/3r) ) ^cxpf-pY)] } ' 

Let us successively bound the various random factors entering into the defini- 
tion of B\tt 1 , a -,,ir J , a , A. The quantity n J , „, Jr) — 7r* , fl , f r) can be 

L exp(— p>)> oxp(— p'r)J H J exp(— p'rp > oxp(— pr) \ ' 

bounded using a slight adaptation of Proposition 12.1. lTI (page[ 



Proposition 2.2.5. For any positive real constants A, A' and 7, with P probability 
at least 1 — 77, /or any positive real constants ft, f3' such that (3 < A-^ sinh(-^) _1 



id p > A'^sinh(^) 



7r oxp(-/3V)( r ) _ 7r cxp(-/3r)( r ) 

log(t) CV(A', 7 )+log(§) C'(A,7) + log(|) 



where 



C*(A,7) = f lof 



exp 



7 ^ smh (_L)_ 7 



7 -^sinh(^; 



f sinh(^)ii'(^^ 2 )}7r: xp( _ Aii) (^ 2 ) 



)(^i) 



< log < 



'exp(-Aii) 



'exp(-Aii) 

exp{2iVsinh(^) 2 < xp( _ Afl) (M')} 



As for 7r 



exp(-pY) w "exp(-/3'r) 



vi ( m ')> we can write with P probability at least 1 — 77, 



for any posterior distributions p and p' : fl —> Mi_(8), 



7P ® p'(m') < log 



T cxp(-Afl) ® ^cxpf-A'fl) 



1 {exp[ 7 $_,(Af')]} 
+ 3C[p J < xp( _ AJI) ] + 3C[p',7r 



exp(-A'fl)J 



logfa). 



We can then replace A with /3-y sm h(;^) an d use Theorem 12 . 1 . 1 21 (page [50 )1 to get 

Proposition 2.2.6. For any positive real constants 7, A, A', (3 and f3' , with P 
probability 1 — r\, 



7P (g> p'{m') 
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< log 



^expl-^f sinh( ® sinh(^)ii] 



{exp[ 7 $_,(M')]} 
K[P.TLp(-/»r)] , sinh(A),A] -log(f) 



1-1 



A _ i 

/3 1 



^[p'^expl^V)] , sinh(^),A'] -log(§) 



1 



0' 



1 



iog(i: 



A' /3' 

The last random factor in B{p 1 p') that we need to upper bound is 



{< xp (-(ir) exp{/3f log[cosh(^)]7r^ p( _^ r) (m')} }■ 



log 



A slight adaptation of Proposition 12.1.131 (page H)Tj) shows that with P probability 
at least 1 — 77, 



lo g{<xp(-/3r) 
2/3 



exp{/5f logfcosM^)]^^^^')}]} 

^), 7 ]+(l- 
JVlog[cosh(£)] 



< ^C*[f sinh(^), 7 ] + (1 - f) log[(< xp[ _^ sinh( ^ )fl] 



exp 



2-1 

/9 



log[cosh(^.; 

T3i 



oM' 



+ (1 + f ) log(§), 



where as usual $ is the function defined by equation (|1.1[ page This leads us to 
define for any i, j € N, any f},j3' 6 R+, 



(2.19) e^^^^c 



f sinh(^),C/5_ 

/iVlog[cosh(f ) 



exp - 



c-i 



Iog[cQ S h(^)] 



oM' 



L +1 



j^logfi/CSjMW] +log(|) 



Recall that the definition of C J (A,7) is to be found in Proposition 12.2.51 pagel76l 
Let us remark that, since 



exp[Na$-a(p)] = expjiVlog 1 + [exp (a) - l]p | 

<exp{iV[exp(a)-l]p}, pe(0,l), 

we have 



a £ 



e(i,/3) < 



C-i 



W 7T 



exp[- 



exp{2/Vsinh(^) 2 < xp[ inh( ^ )i?] (M')} 
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+ log|( 



'W-f s inh($)fl] 



S2 



cxp{7v[exp{(C - I)" 1 log[cosh(f )] } - l]M'^ 
C + l 



r |21og[i/C9)/i(*)] +log(2) 



Let us put 



AT 



Sx [(*, /?), (j, /?')] = T log[cosh(A)] inf 7 "* 



log 



(< xp [-f s inh(<£)fl] ^ "exp[-f s inh(^)i,'_ 



){cxp[ 7 $_^(M')]} 
C*[f sinh(f),C/3] -log(f) 



+ ■ 



c-i 



+ 



C^[f sinh(^),C/3'] -log(f) 



c-i 



log(|) 



e(i, /?) + C(i, /3') - log[3-^(A)e] 



where 



Let us remark that 



S x [(i,(3),(j,P')] < inf 



2iV 7 



log 



(<xp[-f 



s inh(^)ii] ^^expl-f sinh(<^)fl], 



T)A]){ 

exp[iV[exp(£) -1]M']} 



+ 



A 



27V 7 (C-1) A(C-l) 



l0 Si ^Ixpl-f sinh(^)ii] 



exp{2iV sinh(^) 2 < xp[ „ f (M>) }] } 



+ A" 1 l0g<! ( 7T 



+ 



cxp{7v[exp{(C - I)" 1 log[cosh(f )] } - l]M'} | 



A 



27V 7 (C-1) A(C-l) 



exp[-f s inh(A§-)fl] 



ex P {2^sinh(||) 2 ^ phfsinh( ^ )fl] (M0 



+ A" 1 log 7T 



exp[-f sinh(^)fi] 



cxp{iv[cxp{(C - l)- 1 log[cosh(^)] } - l]M'} J 
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^±i^-log[3-^( 7 )^(/3)K/3')M(i)M0>] 



(C + i) 
(C-l)A 

Let us define accordingly 

dcf 



21og[2-y/?M/3>(*K?>] +log[3-V(A)6 



T cxp(-a'iQ ® <xp(-aK) Af')] 



inf Si( inf 

A w I a, 7, a', 7' 



log(f) | ^(a',y)-log(f) , C*(a, 7 )-log(§) 



JMl sinh( ^)_y 7 _M sinh( ^) 

+ 5 A [(i,/3),(j,/J')]|, 

where 

»? = K A M a M7M/ 3 M a 0K70K/ ? ')M i )MC?) f 7- 

Proposition 2.2.7. 

• With P probability at least 1 — 77, for any (i G M+ and igN, 

e« xp (-^))<e(i,/3); 

• Mtft P probability at least 1 — 3n, /or any A, /?,/?' G M + , any i,j £ N, 

S A [(i,/3),(j,/3')] < 5 A [(*,/?), (j, /?')]; 

• WWi P probability at least 1 — 4:7], for any i,j G N, any [3, [3' G M + , 
S(7r: xp (- /3r ),< xp( _^))<S[(i,/3),(j, /?')]• 

It is also interesting to find a non-random lower bound for S(7i"ex P (-/3r))- Let us 
start from the fact that with P probability at least 1 — r], 

[$y(M')] 
< ^cxpf-afl) ® <xp(-afl)( m ) — ■ 

On the other hand, we already proved that with P probability at least 1 — 77, 

< 



7Vtanh(-^) 



a [pW -<xp( Qj R)( r )] 

+ iVlog[cosh(A)]p < xp( _ ai j ) (m') - log(n) 

+ ac(p,^)-ac« xp( _ QK) ,7r i ). 

Thus for any £ > 0, putting /3 — jf^^x) > w ^h P probability at least 1 — 77, 

[$y(M')] 



80 



Chapter 2. Comparing posterior distributions to Gibbs priors 



exp(— a.R) 



log 
exp 



0% log[cosh(A)] 7r ; xp( _ /3r) ( m ') + ^ m ' 



< l0 g{<xp(-/3r) eX p{/ 3 f 1 °g[ COSh (^)]< X p(-^)( m ')} 

X Tlxp(-|8r){ ex p[/ 3 T Io e[ COsh (^)] 7r i«p(-/»r)( m/ ) + £ m ' } 



< 21 0g^<xp(-/3r) 



cxp{[e + /3f log[cosh(A)] 



A 7' 

7r exp(-/3r)( m ')} 



( 2 f ' 4) 



< 21 °g| 7r cxp(-/3r) CX P {[e+ w]<xp(-/3r)( m ')} } 



A 7 ' 



loe 



Taking £ = ^ , we get with P probability at least 1 — 77 
/3A 



47V I" 



oxp[-/3f tanh(A)fl] 



\ ®2 r 

/ L W 



< l0 g|<xp(-^) ex p{^<xp(-,3r)(™')} J 



2/3 (3X 



A 2iV 7 ' 



2^) l0g (l) 



Putting 



iV 2 

A = — lo g[ cosh (w)] 



and T(7) 



def 7tanh{f log[cosh(#)]} ^ 
7Vlog[cosh(-^)] 7^0 



1, 



this can be rewritten as 
47 



log[cosh(^)] (< xp( 



-/3T( 7 )JJ) 



< l°g{<xp(-/Jr) exp{/3f log[cosh(^)]< xp( _ /3r) (m / )} 



2/3 7 /?JVlog[cosh(£)] 



TV 2 log[cosh( 



+ 



2 77 ' 



log! - 



It is now tempting to simplify the picture a little bit by setting 7' — 7, leading to 
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Proposition 2.2.8. With P probability at least 1 — r\, for any ieN, any (3 £ R+, 



del' 



_ | E log[cosh(f )] (jri M _ mim] 
N* log[cosh(f )] 



c- 



Y/2 



iVlog[cosh(^)l \ p -i / \ I 
^ 2C/3 J l0g[2 

- (C + l){log[K/%«] + 2- 1 log(3- 1 e)} 



where C[ 7r cxp(-/3r)] * s defined by equation i2.18[ page\75] 



We are now going to analyse Theorem l2.2.4l (page 173]) . For this, we will also need 
an upper bound for S\(p,p'), denned by equation (|2.131 page [71} , using M' and 
empirical complexities, because of the special relations between empirical complex- 
ities induced by the selection algorithm. To this purpose, a useful alternative to 
Proposition ^. 2. 61 (page [76]) is to write, with P probability at least 1 — 77, 

7P ® p'(m') < 7p®p'[$_^(M')] 

+ ^[P,<xp(-Ai i )] +K[p',*4 p( _ vfl) ] -logfa), 

and thus at least with P probability 1 — 3r/, 
7P (g) p'(m') < 7p ® p' [$_^ (M')] 

+ (l-r 1 )^ 1 (3C[p,< 



l0g|7T, 



exp( — /3r) 



to 1 cxp(— p V) 



cxp(— pY) J 

exp{f log[cosh(f )}p{m')]\ } - C 1 Iog(»?) 
exp{f log[cosh($)]p(m')}] } - C 1 logfa) 



log(ry). 



When p = 7r* xp( _ /3r) and p' = 7r^ xp , ? , ,, we get with P probability at least 1 - 77, 
for any (3, f3', 7 G R+, any £, j e N, 

7P <g> p'(m') < 7p ® p' 7 [(M')] 



e(p) + e(p') 



C + i 
C-i 



log [3 V(7)r7] 



Proposition 2.2.9. FFzi/i P probability at least 1 — 77, /or any p = TT^p^^ji arl 2/ 



p 1 = TV 1 , n , s e ?, 

r cxp( — p'r) 
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Sx( P ,p') < ^log[cosh(A)] p ® p '[$_ W A f')] 



l + ^log[cosh(4)l r 



A 

(C + i) 



(C-l)A 



logfi-^We] + f log[cosh(£)] log [3-^(7)77] 



In order to analyse Theorem 1 2. 2. 41 fpage [73|) . we need to index ? = {pi,. ■■ , Pm} 
in order of increasing empirical complexity G(p). To deal in a convenient way with 
this indexation, we will write G(i,0) as C[ 7r ox P (-/3r)] ' as S[fex P (-/3r)] ' and 

With P probability at least 1 — e, when i < j < fc, as we already saw, 
< Pi(-R) < + A inf + Sa [25 a ( P „ a )] , 

where i — t(j) < t. Therefore, with P probability at least 1 — e — 77, 
Pi(R) < PAR) + A g 2 , J 2^ log[cosh(A)] Pj ® , ( M ')] 



(C + i) 



4 l + f lo g [cosh(A)] e ^ 
A 

K _ .j jA l log[3-V(A)e] + ^l g[cosh(A)] log [3 -^(7)^] 

We can now remark that 

E a (p + q)< E a (p) + q~' a (p)q < 5.(p) + ~' a (0)q = E a (p) + 



and that 



*-«(p + <Z) < *-«(?) + *-„(0)g = *-a(?) + 6XP(a) ~ g- 



Moreover, assuming as usual without substantial loss of generality that there exists 
9 e argmine R, we can split M'{8, 9') < M'(9, 9) + M'{9, 9'). Let us then consider 
the expected margin function defined by 

<p(y) = sup M'(9, 9) - yR'(9,e), yeR+, 



and let us write for any y G 



Pj <8 Pl [<&_ a (M')] < ft ® p,{<f_^ [M'(., 9) + yR'(., 9) + <p(yj\ } 

< PJ {* -A [M'(., g) + <p(y)] } + ^M^r)- 1 ] [pi(jR) _ 



and 



^ _ 2,iV[ex P (^)^l] log[c sh(A) ] A _ 
7 tanh(A) / 
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< [ Pj (R)-R(9)] +S,|^log[cosh(A)] )0j .{$_^[M'(,e) + ^(y)]} 



1 + ^ log [cosh (4)1 

A 



2(C+1) 



log[3-V(A)e] + f log[cosh(A)] l og [3- 1 i/( 7 )» 7 ] 



(C-l)A 

With P probability at least 1 — e— ry, for any A, j,x,y £ K+, any j € {i, . . . , fc — l}, 
PZ(R)-R(0) <Pi(R)-R(0) 



< / 2yiV[exp( 1 I)-l]log[cosh(A)] 
" I 7tanh(A) 



^ | 2xjV[exp(^)-l]log[cosh(A)] 
7 tanh(A) 



[ Pj -(#)-i?(0)] 



S Jj log[cosh( A)] ^(s) + ^)] 



2(C + 1) 
(C-l)A 



+ 4 l + f log[cosh(A) ] _^ ) 
A 

{log[3-V(A) £ ] + Ai og [ cosh ( a.)] log[3-V 



Now we have to get an upper bound for Pj(R). We can write pj = ^ xp ^_p, r y as we 
assumed that all the posterior distributions in T are of this special form. Moreover, 
we already know from Theorem 12.1.81 (page [58)) that with P probability at least 
1 - V, 

[iVsinh(^) - fl'C 1 } [4 M -0>r)(R) ~ <^ { -p>X-R)t R )] 

<C t (pC\P)-y>zW)^]> 
This proves that with P probability at least 1 — e — 2?7, 

p%{R) < R{0) 

i 2yN[exp(%) - l] log [cosh (A 
7tanh(A) 

i | 2xiV[exp(-I)-l]log[cosh(A)] 
7 tanh(A) 

»< k< <-.,„«> - m + tflr w-^ | 

y ox P ( c /»*)\ v iVsinh(A) y 

+ S a | ^ log[cosh( A)] + p( y )] 

1 + Al og r C osh(A)l_ 
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2(C + 1) 

(C- 



-i{log[3-V(A)e] + ^l g[cosh(A)] log [3-^(7)77] } 



The case when j s + 1, . . . , M\ \ (argmaxi) is dealt with exactly in the same 
way, with i — t{j) replaced directly with k itself, leading to the same inequality. 

The case when j £ (arg max t) is dealt with bounding first P^{R) — R(&) m terms 
of pAR) — R(0), and this latter in terms of Pj(R) — R{9). Let us put 



(2.20) 



A(A, 7 ) 
S(A, 7 ) 



( 2x7V[exp(^) - 1] log[cosh(A)] 
^ 7tanh(A) 

i | 2yJV[exp(^)-l]log[cosh(A)] 
7 tanh(A) 



D(X,j )Pj ) = Sa |^l g[cosh(A)]$_, [<p(x) + <p(y)] 



N 



log[cosh(A)] 



e(Pi) 



2 ( C + 1 ){log[3-MA) £ ] 

log[cosh(A)] log[3- 1 I v( 7 )r ? ]}|, 



(C-l)A 

N 
H — 

7 



where G( P j) = G(£,(3 r ) is defined, when pj — 7Tg Xp / p, r y by equation (|2.19l page 
177)) . We obtain, still with P probability 1 - e - 2n, 

^)-R(0)<^4[^R)-Rm ■ 



A(A, 7 ) Lf '' 
B(A,7) 



4(A,7) 



A(A, 7 ) ' 



The use of the factor D(X, r y,pj) in the first of these two inequalities, instead of 
.D(A,7, is justified by the fact that Q(pt) < G{pj)- Combining the two we get 



P^R)<R(9) + ^^[ Pj {R)-R( 



A(A, 7 ) 



Since it is the worst bound of all cases, it holds for any value of j, proving 



Theorem 2.2.10. With P probability at least 1 - e - 2r?, 



Pt(R)<R(6] 



inf <^ 



i?(A,7) 2 
^(A, 7 ) 2 



T exp(-/3r) 0^) ~ -^W 



" £(A )7 ) 

>(A,7) 



£»(A,7,7r, 



exp(— /3r) <* 



A(A, 7 ) 
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< R(0) 



inf { 



+ 1 



where the notation A(X, 7) , £?(A, 7) and D(X,^, p) is defined by equation \2.2U\ vaae 
™ and where the notation C l ((3,j) is defined in Provosition \2.2.5\ (page \7(J{ l. 



The bound is a little involved, but as we will prove next, it gives the same rate 
as Theorem 12.1.151 (page 1551) and its corollaries, when we work with a single model 
(meaning that the support of fx is reduced to one point) and the goal is to choose 
adaptively the temperature of the Gibbs posterior, except for the appearance of the 
union bound factor — log[f(/3)] which can be made of order log [log (AT)] without 
spoiling the order of magnitude of the bound. 

We will encompass the case when one must choose between possibly several 
parametric models. Let us assume that each n l is supported by some measurable 
parameter subset &i ( meaning that Ti l {Qi) = 1), let us also assume that the 
behaviour of -k 1 is parametric in the sense that there exists a dimension di € K+ 
such that 



(2.21) 



Then 



sup /?[tt, 



e X p(-/3fl)(- R )- i g f - R ] < d i 



C 4 (A, 7 ) <log|< xp( _ Afl) [exp{2iVsinh(^) 2 Af'(.,0)} 



+ 2Nsmh(^)\l xp{ ^ R) [M'(.,6)} 



< l0g\ 7T 



^xTVsinhf^) 2 ^ 



exp 2xN sinh( ^) 2 [R - R(6)] j 

-2xN S mh(£_)\l xp{ _ XR) [R-R 
4Nsmh(^) 2 ip(x) 

xp{-[A— 2xN sinh( 27; 
2n) ^cxpf-Afl) [R — R 



T oxp(-Afl) 



< 2xiYsinh(2^) 2 < xp{ _ [A _ 2:l;Arsinh( _^ )2]J?} [R - R 



Thus 



C"(A,7) < 4Ysinh(^) ix [inf R-R(9)] + <p(a 



xdi 



xdj 



2A 2A-4xiVsinh(2^ 



In the same way, 
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e(i,(3)< |^sinh(^) 



[inf R-R(6j] +ip(x) 



e. 
Cxdi 



1 + 



2^Vsinh(if ) V 1 - x(t<mh(§£) 



+ 2N 



'^( 2^-1) ) 



- 1 

X(d; 



ip(x) + x [inf R-R{0)] 



ATsinh(f)- a; C7V[exp(^f rTy )-l] 

(C + i) 



(C-i) 



21og[K/?)M*)] +log(f) 



In order to keep the right order of magnitude while simplifying the bound, let us 
consider 



( 2 ' 

(2.22) C^max (-!,(£-) sin h(%p)' 



2jV 2 (C-l) 



eX P\2AT 2 (C- X l) J 



Then, for any /3 e (0,/3 max ), 
3dC 2 /3 2 



e(i,/3) < inf 



Thus 



+ 4- 



1 + 



/3r). 



27V7 



< 



y[w£R-R(e)]+<p(y) 

(C + i) 
(C-i) 

A fA[exp(#)-l] 



TVtanh(^) [ 7 
^^^[infi?-^)] + ^) + 



^L 1 ~~ 2(C-l)iV 

21og[i/C9)/i(i)] +log(f) 

[v?(x) + <p(y)] 

zdi \ 



^L 1 2(c-i)ivJ 



(C + i) 
(C-i) 



21og[i/C9)/i(i)] +log(2; 



2(C+1) 
(C-l)A 



log[3- 1 KA)e]+^log[3- 1 K7)r?] 



If we are not seeking tight constants, we can take for the sake of simplicity 
X = -f = /3 7 x~y and ( = 2. 
Let us put 



(2.23) C 2 = max C x 



jV[exp(V)-l] 



2iVlog[cosh(%p)] 
^ max tanh(£^p) 



7Vtanh(^ 



N 
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so that 



W)<i + ^, 



.r 



12Cl/?2 'z[infi?-i?(0)] +^(z) + 



-61og[K/?)MW] -31og(|) 



2zCh£] 
N J 



6C 2 

/3 



log[3-V(/3)e] + Ai og [ 3 -V (/3)?? ] 



and 



2xd; 



This leads to 



^(i?)<i? W+ inf^i-^ 



N 



2 ^+MR-R(9) 



+ 







Ci/3 2 
TV 



x[infi?-i?(6»] + <^(z) 
\og[v(P) fi(i)n] 



2xd, 



(4 + f)f 



^^^[infi?-i?(0)]+^) 



61og[K/3)MW] -31og(|) 



6C 2 

/3 



log[3-^(/3)e] +^-l g[3- 1 i,(/3) ?7 ] 



We see in this expression that, in order to balance the various factors depending 
on x it is advisable to choose x such that 



vaiR-MO) 



ip(x) 



as long as x < 
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Following Mammen and Tsybakov, let us assume that the usual margin assump- 
tion holds: for some real constants c > and k > 1, 

R{9)-R{9) > c[D{9,8)] K . 

As D(9, 9 ) > M'(9, 9 ), this also implies the weaker assumption 

R{9) -R{9) > c[M\9,9)Y, e 6, 

which we will really need and use. Let us take /3 max = N and 

1 



flog 2 (iV)l 



riog 2 (Jv)i 

s z k - 



k=i 



Then, as we have already seen, ip(x) < (1 — k 1 )(kcx) k 1 . Thus tp(x)/x < bx . 
where b = (1 — k _1 )(kc) " _1 . Let us choose accordingly 



x = mm 



. f def (m{ @i R-R(6)\ 

T = { 1 J 



,x 2 



def 



N 



Using the fact that when r £ (0, §), (j^) 2 < l + 16r < 9, we get with P probability 
at least 1 — e, for any /3 e supp z/, in the case when x = x\ <x 2l 



p?(R) < inf R + 538 Cj^-b^ \wiR-R 

k Si N &i 

C 2 



+ 



138di + 1661og[l + log 2 (A0] - 134 log [fj,(i)] - 102 log(e) + 724 



and in the case when x — x 2 < x\, 



p~(R) < inf R + 68d[M R - R(9)] + 269C 2 2 -^(x) 



k- ' — e 



138di + 1661og[l + log 2 (iV)] - 134 log [/z(i)] - 1021og(e) + 724 



< infi? + 541Cf4</?0) 



138rfi + 1661og[l + log 2 (iV)] -1341og[^(i)] - 102 log(e) + 724 
Thus with P probability at least 1 — e, 

p~(R) < inf R + mf 1082 Cf max] 6^ [inf i? - R(9)] » , 



/3G(l,iV) 



C2 

/3 



,/4C 2 /3 

138 di + 166 Iog[l + log 2 (iV)] 

- 1341og[^(i)] - 1021og(e) + 724 



2.3. Two step localization 



89 



Theorem 2.2.11. With probability at least 1 — e, for any i G 



p?(R)<MR 

6; 



+ max < 



847CJ 



\ 



[infe, R ~ R(6)] * [di + bg(i±^p) + 5} 
N ' 



2C* 2 [1082 b] ' 



166C 2 



4.2K-1 



where Ci, given by equation \2.23\ vaae 
any case less than 3.2. 



, will in most cases be close to 1, and in 



This result gives a bound of the same form as that given in Theorem l2. 1 .151 (page 
in the special case when there is only one model — that is when /i is a Dirac 
mass, for instance /i(l) = 1, implying that R(6i)—R(9) = 0. Morover the parametric 
complexity assumption we made for this theorem, given by equation (|2.21| page [85|) , 
is weaker than the one used in Theorem 12.1. 1 51 and described by equation l|2.8l page 
o'Bl . When there is more than one model, the bound shows that the estimator makes 
a trade-off between model accuracy, represented by info; R — R(0), and dimension, 
represented by di, and that for optimal parametric sub- models, meaning those for 
which infe; R = infe R, the estimator does at least as well as the minimax optimal 
convergence speed in the best of these. 

Another point is that we obtain more explicit constants than in Theorem 12. 1.1 51 
It is also clear that a more careful choice of parameters could have brought some 
improvement in the value of these constants. 

These results show that the selection scheme described in this section is a good 
candidate to perform temperature selection of a Gibbs posterior distribution built 
within a single parametric model in a rate optimal way, as well as a proposal with 
proven performance bound for model selection. 



2.3. Two step localization 

2.3.1. Two step localization of bounds relative to a Gibbs prior 

Let us reconsider the case where we want to choose adaptively among a family of 
parametric models. Let us thus assume that the parameter set is a disjoint union 
of measurable sub- models, so that we can write = U me A/0 m , where M is some 
measurable index set. Let us choose some prior probability distribution on the 
index set // G M^M), and some regular conditional prior distribution 7r : M — > 
M^(0), such that 7r(i,0i) = 1, i G M. Let us then study some arbitrary posterior 
distributions v : fl -> M\(M) and p : Q x M M^(6), such that p(cj, i, Gi) = 1, 
(J G O, i G M. We would like to compare vp(R) with some doubly localized prior 
distribution [i cxp[ __e_^ p{ fiR){R)] [7r e x P (-/3H)] (R) (where C2 is a positive parameter 
to be set as needed later on) . To ease notation we will define two prior distributions 
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(one being more precisely a conditional distribution) depending on the positive real 
parameters j3 and (2, putting 



(2.24) 



7T = 7r exp( _ /3H ) and p = Ai cxp[ __j^ ¥ 



T(R)Y 



Similarly to Theorem ll.4.3l on page[37]we can write for any positive real constants 
[3 and 7 



P<^ (7* tt) ® (/Ztt) 



exp 



-JVlog[l -tanh(^)-R'] 

— 77-' — iVlog[cosh(-^)]m' 
and deduce, using Lemma Tl. 1.31 on page|H that 

(2.25) pi exp 

- ~f(yp - pW)(r) - A r log[cosh(-^-)] (yp) ® (jHr)(m' 



< 1, 



sup sup < — iVlog[l — tanh(^)(i//9 — pir)(R)] 



aC(i/,p)-i/[DC(p,7r)]} 



< 1. 



This will be our starting point in comparing vp(R) with ~pT(R). However, obtaining 
an empirical bound will require some supplementary efforts. For each index of the 
model index set M , we can write in the same way 



7T 55 7T 



exp 



-TV log [l - tanh( - 7 r' - A^log[cosh(^)]m' 



< f. 



Integrating this inequality with respect to p and using Fubini's lemma for positive 
functions, we get 



exp 



-AT log [1 - tanh(^)i?'] - 7/ - iVlog[cosh(^)]m' 



< 1. 



Note that ~p(T is a probability measure on M x x 6, whereas (JIW) (g) (p W) 
considered previously is a probability measure on [M x 6) x (M x 6). We get as 
previously 



(2.26) P<^ exp 



sup sup I — iVlog[l — tanh(-^)^(p — tt)(R)] 

— 7^(p - 7f)(r) - TVlogfcosh^)] v{p ® 7f)(ra') 



3C(i/,p)-i/[DC(p,7f)]} 



< 1. 



Let us finally recall that 



(2.27) 3C(i/,7i) = jfe{v - p)lf(R) + X{v, p) - X(fl, p), 

(2.28) 3C(p,7f) = /3(/?-7f)(-R)+3C(/9,7r) -3C(7r,7r). 

From equations (j2~23|) . (P^6]) and (gUgj) we deduce 
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Proposition 2.3.1. For any positive real constants f3, 7 and £2, with P probability 
at least 1 — e, for any posterior distribution v : fl — > M^_(M) and any conditional 
posterior distribution p : fl x M — > Mi_(0), 

- TV log [1 - tanh(^)(^ - 7*5T)(ii)] - /3i/(p - 7f)(i2) 

< j(vp - M7f)(r) + iVlog[cosh(-^)] (i/p) <g> (jlW)(m') 

+ X(v,Jl) + u[X( P ,tt)] -v[X(W,n)] +log(f). 

and 

-jyriog[l-tanh(^)i/0»-7f)(iJ)] 

< 7^(p — 7f)(r ) + iVlog[cosh(-^-)] v{p <g> 7f)(m') 

+ 3C(i/,/l) + i/[3C(p,7r)] +log(f), 

where the prior distribution ~pir is defined by equation (I2.24p on vaae[Udi and depends 
on P and C2 ■ 

Let us put for short 

T = tanh($) and C = iVlog[cosh(#)] . 

We will use an entropy compensation strategy for which we need a couple of 
entropy bounds. We have according to Proposition 12.3. 1[ with P probability at 
least 1 — e, 



v [X(p, W)] = &v{p - 7f) (R) + v [X(p, tt) - DC(¥, tt)] 

/3 



< 



iVT 



7^(p — 7r)(r) + CV(p ® 7r)(m') 

+ 3C(i/,7I) + i/[3C(p > 7r)] +log(f) 



+ i/[DC(p, tt) — 3C(tt, tt)] . 



Similarly 
%{v,p) 



I + C2 

< 



(y - JI)tt(R) + X(u, (i) - X(]I, p) 

P 



(1 + ( 2 )NT 



j{y — /x)7r(r) + C{vrt) ® (/i7r)(m ) 



+ 3C(i/,7l) + log(i 



+ 3C(i/, p) -Xfcp). 



Thus, for any positive real constants /3, 7 and Q, i = 1, . . . , 5, with P probability 
at least 1 — e, for any posterior distributions v, v% : fl — > Mi_(0), any posterior 
conditional distributions p, pi, P2, Pi, P5 :!lxM^ M]j_(0), 

- JV log [1 - T(i/p - /!¥) (#)] - 0v(p - 7f) (i?) 

< ^{vp — JIW)(r) + C(yp) ® (jlTf)(m') 

+ X(v,]l) + v[X(p,ir)-X{ir,iT)] +log(f), 
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NT 

p[X(p u W)] < Ci77*(pi -7f)(r) + CiCp(pi ®7F)(m') 

+ Cl7*[3C(pi,7f)] + Cl l0g(f ) + Cl— M[3C(^l,7r) - DC(7f,7T) 

iVT 

C 2 — i/[3C(p2,7F)] < C27KP2 -7f)(r) +C2C^(p 2 ®7f)(m') 

+ C 2 3C(^P) + C2^[3C(P2,7f)] +C 2 log(f) 



7VT 

+ C 2 ^- i/[3C(p2,7r)-DC(7f,7r)], 



7VT 

Cs(l + C2)— 3C(^ 3 ,M) < C37(^3 -7*)7f(r) 



+ C3C[(^37f) ® (KSPl) + ® (p7f)] (m') + C3^(^3, M) + Cs Iog(f ) 



AT 

+ Cs(l + C2)— [3C(K3, M) - 3C(/I, m)] : 



AT 

C4 — ^[^(P4,7f)] < C47^3(P4 -7f)(r) 



+ C4Cj/ 3 (/04 ® 7F)(m') + C4^(^3, M) + C4^3 [3C(P4,7F)] + C4 log(f ) 

+ C4-jg-^[3C(p4,7r)-DC(7f,7T)], 

AT 

C5— MpC(P5,7r)] < C57M(P5 - 7f)W + (5C~p(p 5 ® 7f)(m') 

+ C5M[3C(P5,^)] +C 5 l0g(|) + C5 — M[3C(P5,^) - DC(7f,7T)]. 

Adding these six inequalities and assuming that 
(2-29) C4 < C3 [(i + C2 )^r_i] ; 

we find 

- TV log [1 - T(vp - TIW)(R)] - P{v P - -pT)(R) 

< -AT log [l -T(up--pT)(R)] - 0(vp-Jilf)(R) 

+ Ci(^ - l)p[X(pi,W)] + C 2 (^ - l)v[X{p2,T)] 

+ C 4 (^ - l)u 3 [X(p 4 ,W)] +( 5 (^f l)ji[X(p 5 ,T)] 

< ~/(vp-JIT)(r) +Ci7M(Pi -tt)W + C27KP2 -7r)(r) 

+ C37(^3 - ~p.)K{r) + Cil^{pi - T*){r) + C57M(P5 - 7r)(r) 

+ C [(i/p) <g> (Jl7f) + CiM(Pi O tt) + C2^(P2 <8> 7f) 

+ C3(^37f) ® (^3Pl) + C3(^3Pl) ® (M7f) 

+ C4^3(P4 ® 7F) + CsMl/OS ® 7F)] (™') 
+ (1 + C2) [3C(i/, n) - X(JI, n)] + V [X(p, 7T) - 5C(7T, 7T)] 

+ Ci [3C(pi , tt) - 3C(tt, tt)] + C2 [3C(P2 , tt) - 3C(tF, tt)] 
+ Cs(l + C2)^[3C(^3,M) - 3C(M,A*)] + C4^^[3C(P4,tt) - DC(7r,7r)] 

+ C 5 ^/I[3C(P5, - 3C(7f, tt)] + (1 + Ci + C2 + Cs + C 4 + Cs) log(f ), 
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where we have also used the fact (concerning the 11th line of the preceding inequal- 
ities) that 

-0(vp--pT)(R)+X(v,Ji) + v[X(p,T)] 

< -0(vp - JI if)(R) + (i + C 2 )3C(^,M) + v[X(j>,*)] 

= (1 + C2) [X(u, p) - X(p, m)] + v [X{p, tt) - X(n, tt)] . 

Let us now apply to 7f (we shall later do the same with ~p) the following inequalities, 
holding for any random functions of the sample and the parameters h : 0, x <d — > M 
and 3 : O x 6 ^ 1, 

7f (5 — h) — X(w, 7r) < sup p(g — h) — X(p, 7r) 

= log{7r[exp( ff -/i)]} 

= log{7r[exp(-/i)] } + log{7r cxp( _ ft) [exp(#)] } 

= -7Tex P (-/ l )(/i) - ^(7r OX p(-h),7r) +log{7r CX p(_, l )[exp(5()]}. 

When /i and g are observable, and h is not too far from /3r ~ this gives a 
way to replace W with a satisfactory empirical approximation. We will apply this 
method, choosing pi and p 5 such that 7x7f is replaced either with ~ppi, when it comes 
from the first two inequalities or with /J/95 otherwise, choosing p2 such that vT is 
replaced with ^p2 and pi such that ^37f is replaced with ^3/34. We will do so because 
it leads to a lot of helpful cancellations. For those to happen, we need to choose 
Pi = ^cxpf-Air); * = 1,2,4, where Ai, A2 and A4 are such that 



(2.30) 
(2.31) 

(2.32) 

(2.33) 

and to assume that 
(2.34) 



(l + Ci)7 = Ci^A 1 , 

C27=(1 + C2^)A 2 , 
NT 

(U - Gh = U— a 4 , 

C37 = C5^A 5 , 



C4 > Ca 



We obtain that with F probability at least 1 — e, 

- N log [l - T(pp - It7f)(i?)] - 0(vp - TIW)(R) 

< j(vp - ppi){r) + Csli^Pi - PPb)(r) 



+ Ci^M log 



Pi i exp 



Cj^[i^p + CiPi](m') 



+ (l + C2^)^|log|p 2 |exp 



1+C2 



wC2P 2 (TO-') 



+ C^s log 



+ C5^ log 



Pi< exp 



P5 S CXp 



C ntc; [Cs^spi + CbPb] (m') 
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+ (1 + C2) [X(y, m) - %(% n)] + v[X{p, tt) - X( P 2,tt)] 
+ ( 3 (l + (2)^[X(v 3 ,p)-X(- Pl p)] 

+ ( 1 + £&) Ml)- 

^ '4 = 1 ' 

In order to obtain more cancellations while replacing ~p by some posterior distri- 
bution, we will choose the constants such that A5 = A4, which can be done by 
choosing 



(2.35) 



Cs 



C3C4 



C4-C3 

We can now replace ~p with p C xp -£ 1 pi(r)-£4p4(r)> where 



(2.36) 
(2.37) 



1 (i + fc)(i + ^6)' 



7C3 



(i + &)(i + TO 



Choosing moreover v 3 — /z exp _f 1 p 1 ( r )_g 4l04 ( r ), to induce some more cancellations, 
we get 

Theorem 2.3.2. Let us use the notation introduced above. For any positive real 
constants satisfying equations \2.29\ page [92]) . \2.3G\ page [9lfy . \2.31\ page \9~3}) , 
fOl page\Mj), $£M page\93\), (2~3~4\ pageW^, IPS} page\9l$, T2~3h\ page\9$, 
\2.31\ page \ 9J$ , with P probability at least 1 — e, for any posterior distribution 
v : f2 — > Mi_(iVf) and any conditional posterior distribution p : x M — » Mi_(0), 

- Nlog[l -T(vp--pw)(R)] -P(vp--pw)(R) <B(v,p,l3), 
where B(v, p, (3) d = j(vp - v 3 pi)(r) 



+ (i + C 2 )(i + ^C 3 ) 



x log< v 3 



Pi S exp 



;i»t 



x p4< exp 



C 



NTQ 



■ [C3^3Pi + C5/O4] (m ; ) 



3(l + C2)(l+ i ¥-C3) 



(l + Ca^Mlog^exp 



C4^3<!l0g 



|l0g|p2|cxp 1+ ^ M r C2P2(m) I l 
C NTU [&"3Pl + C4P4] ("0 I 



pi< exp 



(1 + C 2 )[3C(^M)-^3,M)] 

+ V [X(j>, 7T) - 3C(p 2 , 7T)] + ( 1 + £ Ci ) log( 



This theorem can be used to find the largest value (3{vp) of (3 such that B{v, p, 
(3) < 0, thus providing an estimator for (3(yp) defined as vp(R) = JIp( vp yW p( vp )(R), 
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where we have mentioned explicitly the dependence of p and 7f in (5, the constant 
(2 staying fixed. The posterior distribution vp may then be chosen to maximize 
(3{yp) within some manageable subset of posterior distributions 3 , thus gaining 
the assurance that vp{R) < Ji^, Jf-g, AR), with the largest parameter fi{vp) 

that this approach can provide. Maximizing (3{vp) is supported by the fact that 
lim^+oc JZpTp(R) = ess inf^ R. Anyhow, there is no assurance (to our knowledge) 
that [3 1— ► JIpTp (R) will be a decreasing function of /3 all the way, although this may 
be expected to be the case in many practical situations. 

We can make the bound more explicit in several ways. One point of view is to 
put forward the optimal values of p and v. We can thus remark that 



v[ 1P {r) + X(p, tt) - X( P 2,ir)] + (1 + (2)^ P) 

ri 

= v aC[p,7r exp( _ 7r) ] + \ 2 p2(r) + / 7T eX p(_ ar )(r)da 

J\ 2 

= v{X[p,TT cxp( ^ r) ] } + (1 + (2)X[V, p 



+ {l + b)X{v,p) 



A 2 



(1 + C2) log<^ p 



exp 



I + C2 



1 n 



exp(-ar) 



(r)dc 



Thus 



B(v,p,f3) = (l + C2)[a^3pi(r)+^3/94(r) 

+ log{M[exp(-CiPi(r) - Upi(r))] } 
A 2 



7^3Pi(r) + (l + C2)(l + irC3) 



P2(r) 



1 f 7 



exp( — ar) 



(r)da 







(" 


pi 1 exp 



/3(l + C2)(l+- ! ^ i: <3) 



x p4< exp 



C lvTa i^3Pl + C5P4] (m') 



/3(l + C 2 )(l+-^ ; <3) 



+ (i + C 2 ^) 



'jlogjp 2 jexp 



TT fwC2P2(m') 



pi S exp 



C lvTcI [Cs^sPi + Cm] (m') 



+ v{X[p, 7T CX p(_ 7r) ] } 

+ (i + ^["'./•^^r . exp( _ ao( ^ 



+ (i+ECi)iog( 



i=l 



This formula is better understood when thinking about the following upper bound 
for the two first lines in the expression of B{v, p, /?): 
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(I + C2) €iV3pi(r) + £41/3/34 (r) +log{/z[exp(-^ipi(r) -f 4 /? 4 (r))]} 

A 2 



(l + C 2 )log{/i 



exp 



i + c 2 " P2(r) 

1 f7 



1 + C2 J\_ 

< v z 



^cx P (-ar)(r)da 



A2P2M+ / 7r CX p(-Qr)(f)rfa - lPi{r) 



Another approach to understanding Theorem 12.3.21 is to put forward po 
7r oxp(-A r)j for some positive real constant Aq < 7, noticing that 



v[X(p ,i:) - 3C(p2,7r)] = X v{p 2 - Po)(r) - v[X(p 2 ,po)] ■ 



Thus 



B{v,pa,p) < ^[(7 ~ A )(p - Pi){r) + A (p 2 - Pi)(r)] 

+ (i + C 2 )(i + ^C 3 ) 



x log< v z 



Pi S exp 



ClNT 



x p4< exp 



/3(l + C2)(l+' ! ^-C3) 



3(l+C2)(l+ i ^<3) 



(l + C2 i 7 r)^^log^ 2 <iexp 



+a^ 3 log 



P4S exp 



C NTU i^SPl + C4P4] (TO') 



+ (1 + C 2 )3C 



^' (T- A 0)P0('') + A 0P2('-) 

eXp V PRa 



[3C(p 2! p )] + (l + ^C,)log 



i=i 

In the case when we want to select a single model m(w), and therefore to set 
v = 8-~-, the previous inequality engages us to take 

m 6 arg min (7 - X )p a (m, r) + Xop 2 (m 7 r). 

In parametric situations where 



7Tc Xp (-Ar)M r*(m) + 



d e (m) 



we get 

(7-Ao)po(TO,r)-Aop 2 (m,r)~7[r*(m)+d e (TO)(^ + ^^)] I 
resulting in a linear penalization of the empirical dimension of the models. 
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2.3.2. Analysis of two step bounds relative to a Gibbs prior 

We will not state a formal result, but will nevertheless give some hints about how 
to establish one. This is a rather technical section, which can be skipped at a first 
reading , since it will not be used below. We should start from Theorem 1 1 .4 . 2 1 (page 
1361) . which gives a deterministic variance term. From Theorcm ll.4.2[ after a change 
of prior distribution, we obtain for any positive constants ot\ and any prior 
distributions p\ and /i 2 G 3VCq_(M), for any prior conditional distributions tti and 
7?2 : M — > M^j_(0), with P probability at least 1 — 77, for any posterior distributions 
vipi and V2P2, 



OL\{v X px - v 2 p2){R) < ct 2 (vipi - v 2 p2){r) 

+ %[{VlPl) (g) (V2P2), (Ml 7Tl) ® (Ih 7T2)] 

+ log|(^i 5ri) «» (^2 7F2) exp{-a 2 ^^2(R',M')+a 1 R'} J-logfa). 
Applying this to ct\ — 0, we get that 



(vp - v 3 px){r) < — 
a 2 



X[0>p) <S> (vapi), (pn) ® (p 3 7ri)] 
+ \og{(pv)®(p 3 n 1 ) exp{a 2 *_22 (R',M')} } - logfa) 
In the same way, to bound quantities of the form 

log< v 3 



Pi S exp 



sup< pi sup 

V5 I P5 



Ci(i/p + CiPi)(m') 
x p 4 |exp C 2 [C,z^ 3 pi + C5/O4] (™') 
|Ci [(i/p) <g> (v 5P5 ) + CiMpi ® fls)] (m') - DC(p 5 , Pi)} 



+ p 2 supl C 2 [Ca(^3Pi) ® (^5/Oe) 

+ (5^5(^4 <8> Pe)] (to') - 3C(p 6 , p 4 )| - X(u 5 ,u 3 ) 



where C\, C2, Pi and P2 are positive constants, and similar terms, we need to use 
inequalities of the type: for any prior distributions Jii Wi, i = 1, 2, with P probability 
at least 1 — 77, for any posterior distributions j/;/9;, i = 1,2, 



03 



(^lPi) ® {v 2 p 2 ){m') < log|(/Ji 7T 1 ) <8> (/J 2 7r 2 )exp a 3 $^3 (M')j } 

+ X[(v\pi) (g> (1/3/02)1 (Ml ^1) ® (#2 ^2)] - log(r?). 



We need also the variant: with P probability at least 1 — 77, for any posterior dis- 
tribution v\ : fl — > 3\/l\_(M) and any conditional posterior distributions pi,p2 '■ 
Q X M — »• Mi (6), 



Q!3^i(pi ® p2){m) < log|/Ji(7Ti ® 7r 2 ) exp a 3 $_^.(M') } 
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+ X{V1,H\) +V 1 {X[pi ®p 2 , 7Tl ®tt 2 ]} -log(f?). 



Wc deduce that 



log< v 3 



P\ \ exp 



Ci(vp + (iPi){m') 



x pi< exp 



C 2 [C3^3/?l + C5/94](m') 









sup< 


pi sup 


-{ 


^5 


P5 


«3 I 



7r) ® (fj.5 7T5) exp|^Q!3$_^2.(M')j | 
+ DC[(l/p) ® (f5P5), (P^T ® (A*5 7T5)] + log(|) 



+ Ci 



logjps^i ® 7r 5 ) exp a 3 $_^3.(M') j 



+ 3C(f 5 ,^ 5 ) + ^ 5 {^[P1 ® P5,7Tl I8)7r5]} + l°g(|) 



- 3C(P5,Pi) 



+ P2 SUp 

P6 



^-|log|(^ 3 7ri) <g> (^ 5 7r 6 )exp a 3 $_^.(M')j | 



+ X[(v 3 px) ® Kpe), (A»3 7Tl ® (/X 5 7T 6 )] 4- log(|) 



+ Ci 



log|^ 5 (7r 4 ®7r 6 ) exp a 3 $_^3 (M')] } 
+ X(u 5 ,Ji 5 ) + v 5 {X[p 4 ® P6,tt 4 (gi tT 6 ] } + log(|) 



- ^(P6,P4) 



- X(vs,v 3 ) 



We are then left with the need to bound entropy terms like X{v 3 pi, p> 3 ni), where 
we have the choice of /z 3 and wi, to obtain a useful bound. As could be expected, 
we decompose it into 

K(l/3Pl,jLt 3 7Ti) = X(l/ 3 ,Jl 3 ) + ^3[3C(Pl,7Tl)]- 

Let us look after the second term first, choosing tti = 7r exp (_ | g 1 i{): 
^3 [X( Pl , 7?i)] = ^3 [/?i(Pi - 5ri)(i2) + 3C(pi, tt) - 0C(5ri, tt)] 



< 



ft 



Q'l 



Ck2^3(Pl - 7Tl)(r) + X{u 3i p 3 ) + V 3 [X(pi,-Kl)] 

+ log{/Z 3 (5r? 2 ) [exp{-a 2 1<^(#, M') + ai Ef}] } - logfa) 

+ K S [aC(pi,7r)-3C(5ri,7r) 

9C(i/ 3 ,/U 3 ) + u 3 [X(pi^i)] 

+ log{/7 3 (^f 2 ) [exp{-a 2 *^(fl', M') +aii?'}]} - logfa) 

+ ^{3C[pi,7T exp( _^: 



r) 
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Thus, when the constraint Ai = thjm j s satisfied, 

' 1 ai ' 



v 3 [%{ Pl ^ x )\ <(l-—) X — Ufa,]**) 

+ log{/x 3 (5ff 2 ) [exp{-a 2 *^(i?', AT') + a x i?'}] } - logfa) 

We can further specialize the constants, choosing ai = N sinh(^r), so that 

-a2^^.(R',M')+a 1 R' < 2iVsinh( —Ym'. 
« \2NJ 

We can for instance choose «2 = 7> oti = ^Vsinh(-^) and (3\ = Ai — sinh(-^-), leading 
to 

Proposition 2.3.3. Wisf/i i/ie notation of Theorem \ 2.3.21 the constants being set 
as explained above, putting 7?x = 7r cxp (-Ai^- siah(' T )fl)' ™^ P probability at least 

i-n, 



va[X(jn,n)] < (l 



More generally 



Ai\-iA 



7 ' 7 



3C(i/ 3 ,^ 3 ) 

+ log{? 3 (5ff 2 ) [exp{27Vsinh(^) 2 M'}] } - logfa) 



^3 



W^.)]<(i-^)-^ 



K(^ 3 ,/z 3 ) 



+ log{/Z 3 (Srf 2 ) [exp{2iVsinh(^) 2 M'}] } - log(r?) 



V 3 [3C(p lPl )]. 



In a similar way, let us now choose /i 3 = /x e xp[-a 3 7r(R)]- We can write 
%(v, p, 3 ) = a 3 (v - Jl 3 )lr(R) + %{v, /i) - 3C(p 3 , 

«3 



< 



Q'l 



«2(> - MsKM + ^(^,^3) 

+ log{(£ 3 w)® (£ 3 7f) exp{-a 2 *^2(i?',M') + aii?'} J-log(ry) 

+ 3C(>,m) -3C(iu 3 ,M). 

Let us choose 02 = 7, ai = ./Vsinh(-^), and let us add some other entropy inequal- 
ities to get rid of 7f in a suitable way, the approach of entropy compensation being 
the same as that used to obtain the empirical bound of Theorem 12.3.21 (page [94)) . 
This results with P probability at least 1 — r) in 



(l-^W 3 )<^ 



7(1/ - M3)7r(r) 
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+ log{(/i 3 7f) ® (us*) [exp{- 7 * 7 (R ; , M') + aitf}] } + log(|) 

+ %{v,jj) -3C(ju 3 ,/i), 



Ce(l-— )?3[3C(P6,7F)] <Ce — 



Q:i 



7M3(pe ~ 7r)(r) 
+ log{j5 3 (7F 02 ) [exp{- 7 *, (i?', M') + ai R'}] } + log(§) 



C 7 (l-£-)M3[3C(P7,7f)] <C7 



7M3(P7 - 7r)(r) 

+ log{/Z 3 (5f® 2 ) [expj-7* ^ (i?', M') + } + log(§) 

+ C7M3[3C(P7,7r)-DC(7f,7r)], 



^(l-|-)l/[3C(p 8) 7f)] <C8 



ai 



7^(p 8 -7r)(r) +3C(i/, p 3 ) 

+ log{p 3 (7f® 2 ) [exp{- 7 vl/^(i?', M') + ai R'}] } + log(|) 

+ Csi/[3C(p8,7r)-DC(7f,7r)] 1 



Ql 



7i/(p 9 -rr)(r) + 3C(v, p 3 ) 

+ log{p 3 (7f 02 ) [exp{- 7 * , (i?', M') + aiJ R'}] } + Iog(§) 

+ C9f[3C(p9,7r)-aC(7f,7r)], 

where we have introduced a bunch of constants, assumed to be positive, that we 
will more precisely set to 

^8 + X 9 = 1, 

7 

(Ce/3 + x 8 a 3 ) — = A 6 , 
ai 

7 

(C7/? + a;9a 3 ) — = A 7 , 
ai 

7 

(Cs^ - x 8 a 3 ) — = A 8 , 
7 

(Cq/3 - £ 9 a! 3 ) = A 9 . 



We get with P probability at least 1 — 77, 

fl - — - (Cs + Cq)— )0C(i/,/Zs) < 

7[^(x 8 p 8 + £9/99)0) - ^{x&pe + x 9 p 7 )(r)] 
log{(/Z 3 7r) <8> (m 3 tT) [exp{- 7 *^(i?', M') + a^i?'}] } 
+ (Ce + (7 + Cs + Cs>)£ log{ji 3 (7f® 2 ) [exp{- 7 * , (i?', M') + aii?'}] } 



ati " a.\. 
ai 
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+ n) - 3C(/Z 3 , m) + (— + (Ce + Cr + Cs + Cq)— ) Mf 



Let us choose the constants so that Ai = A7 = A9, A4 = Xq = Ag, a 3 xg-^- — £1 and 
a 3 Xg-^- = £4. This is done by setting 



x 8 = 




£4 


si 


1 S4 


x 9 = 




6 




+ iV 


a 3 = 


N 
~: 


sinh(^)(6+e 4 ), 


Ce = 


N 

7 


smh (w) ^ 


C7 = 


JV 

7 




Cs = 


JV 

7 


• 7 \ (^4 + £4) 


C9- 


JV 

7 





The inequality Ai > £1 is always satisfied. The inequality A4 > £4 is required for 
the above choice of constants, and will be satisfied for a suitable choice of ( 3 and 

a- 

Under these assumptions, we obtain with P probability at least 1 — r\ 
(l- — -(Cs + C 9 )— )x(v,]i 3 ) < (^-M 3 )(£iPi+£ 4 p4)(r) 



+ ^log{(^ 3 7f) ® (/I 3 7f) exp{- 7 *i(i?',M') + a 1 i?'} } 
+ (Ce +(7 + Cs + C 9 )£ log{M 3 (7f® 2 ) [exp{- 7 *i(ii', M 1 ) + ai R'}] } 

+ X(u, /*) - 3C(£ 3 , M ) + (— + (Ce + Cr + Cs + Co) — ) log(|). 

This proves 



Proposition 2.3.4. TTie constants being set as explained above, with P probability 
at least 1 — r/, for any posterior distribution v : O — ► (M), 

3C(^ 3 ) < (1 - — - (Cs + GO—) -1 [oc(i/,^) 

+ ^log|(/2 3 7f) (8> (/J 3 7F)[exp{ -7*^, (i?',M') + a 1 R'} } 

+ (Ce + Ct + Cs + Cs>)£ log{M 3 (7f® 2 ) [exp{- 7 *7 M') + } 

+ (- + (Ce + C7 + C8 + C9)-)log(|) 
Vai ai / ' 



Thus 
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DC(l/3yOl,//3 7Tl) < 



1> 7 



»3 
(11 



(C 8 + C 9 )£ 

^log{(iu 3 7F(g) (/I 3 7f)[exp{- 7 *, (fl'.Af) +a 1 fl'}l) 
aiL L JJ 

+ (Ce + C7 + Cs + Cq)^- log{/l3(7f® 2 ) [exp{- 7 * , (i?', M') + ai R'}] } 

+ (^ + (C6 + C7 + C8 + C 9 )^)l0g(f) 

logj^^f 2 ) [exp{2iVsinh(^) 2 Af'}] } - log(|; 



We will not go further, lest it may become tedious, but we hope we have given 
sufficient hints to state informally that the bound B(v, p, (3) of Theorem l2.3.2l (page 
94} is upper bounded with P probability close to one by a bound of the same flavour 
where the empirical quantities r and m! have been replaced with their expectations 
R and M'. 



2.3.3. Two step localization between posterior distributions 

Here we work with a family of prior distributions described by a regular conditional 
prior distribution rr = M -> M^(6), where M is some measurable index set. This 
family may typically describe a countable family of parametric models. In this case 
M = N, and each of the prior distributions .), i £ N satisfies some parametric 
complexity assumption of the type 

limsup^rTTeW^m^, )(R) — ess inf R\ = di < +oo, i E M. 

Let us consider also a prior distribution p E M.+ (M) defined on the index set M. 

Our aim here will be to compare the performance of two given posterior distri- 
butions vipi and V2P2, where vi,v 2 : Q —> M^(Af), and where pi,p2 ■ ^ x M — > 
Mi_(0). More precisely, we would like to establish a bound for (vipi — V2P2){R) 
which could be a starting point to implement a selection method similar to the one 
described in Theorem l2.2.4l (pagc [75|) . To this purpose, we can start with Theorem 
12.2.11 (page [M]) . which says that with P probability at least 1 — e, 

- AHogjl - tanh(A)(i/ipi - v 2 pi){R)] < Xfapi - v%Pi){r) 

+ JVlog[cosh(^)](i/ipi) ® {v2p2){m) +X(y 1 ,]j) + X(v 2 ,Jj) 

+ y 1 [X(p 1 ,%)] + v 2 [X(p2,n)] - log(e), 

where p E M+(M) and 7? : M — > M]j_(0) are suitably localized prior distributions 
to be chosen later on. To use these localized prior distributions, we need empirical 
bounds for the entropy terms X(vi, p) and Vi [3C(pj, 5?)] , i = 1, 2. 

Bounding j/[3C(p, 7?)] can be done using the following generalization of Corollary 
12 .1.1 91 page 1681 

Corollary 2.3.5. For any positive real constants 7 and A such that 7 < A, for 
any prior distribution pZ E JA l + {M) and any conditional prior distribution ir : M —* 
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JA\(Q), with P probability at least 1 — e, for any posterior distribution v : fl 
JA^(M), and any conditional posterior distribution p : fl x M — > M^(O), 



exp[-iV^ tanh(^.)fl] 



]} <K'{v,p, 1 ,\,e) + T ^X{v,Ti), 



where 



^>,p, 7 ,A,e) d = (l-j) ^i/fDC^Tr. 



^cxp(— 7'r)J 



2 



log(e) + jlog (exp{A^log[cosh(A)]/>(m')})]} 



To apply this corollary to our case, we have to set 



n - 7r cxp[-Afitanh(A)_R]- 

Let us also consider for some positive real constant [3 the conditional prior distri- 
bution 

7f = 7I"cxp(-/3fl) 

and the prior distribution 

M = / i cxp[-ceiF(fl)] ■ 

Let us see how we can bound, given any posterior distribution v : fl —> Mi(M), 
the divergence X(v, ~p). We can see that 

X(v,Jt) = a(i/--p)w(R)+X(v,fi) -X(jL,n). 

Now, let us introduce the conditional posterior distribution 

and let us decompose 

(v-Ti)[n(R)] =v[T(R)-n(R)] + {v - p) [$?(£)] +7*[w(i2) -5r(iJ)]. 
Starting from the exponential inequality 



[tt <g> tt] exp{-./Vlog[l - tanh(^)i?'] - jr' - AHogfcosh^)]™'} 



<L 



and reasoning in the same way that led to Theorem 12.1.11 (page I52p in the simple 
case when we take in this theorem A = 7, we get with P probability at least 1 — e, 
that 



7Vlog{l - tanh($)i/(7f + (3v{W — 7?) (R) 



< v 



log 



gj?? exp{iVlog[cosh(^)7f(m')} } 



+ 3C(i/,7Z)-log(e). 



N log{l - tanh(^ )/Z (tt - tt) (R) } - Pp(tc - tt ) (i?) 



< 



log|7r exp{A r log[cosh(-^)7f(TO / )} | 



log(e). 
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In the meantime, using Theorem 12.2.11 (page [69]) and Corollary 12.3.51 above, we 
see that with P probability at least 1 — 2e, for any conditional posterior distribution 
p: Q x M -> M\(&), 



7Vlog{l - tanh(£)(i/ - p)p(R)} < \{u - Jl)p(r) 

+ N log[cosh(^)] (yp) ® (7*p)(m') + + 5?) + /l) - log(e) 

< A(i/ - p)p{r) + ^log[cosh(A)] (yp) (g, (pp)(m') + X{v,p) - log(e) 

+ (l- + +log{7f[exp{iV^log[cosh(A)]p( TO ')}]| 

A -l) _1 [DC(^/l)-21og(e 



Putting all this together, we see that with P probability at least 1 — 3e, for any 
posterior distribution v S Mi_(M), 



1 - 



iVtanh(^)+/3 7Vtanh(A)(i_2) 



a 



a 



JVtanh(^) -13 



X{v,p) < 

log {7? exp {iVlog[cosh(i)]7f(m')}]} 
log|7r exp{iVlog[cosh(-^-)]7f(TO')} j 



log(e 
- log(e) 



+ a[7Vtanh(A)]" 1 | 

\{v - Ji)tt (r) + iVlog[cosh(-^)] (vtt) ® (jm){rri) 
+ (l - 1) ~\v + -P) log{^[exp{7Vj log[cosh(A)]^( TO ')}] } 

- ii| log(e) j + X{u, p) - X(p, p). 

Replacing in the right-hand side of this inequality the unobserved prior distribution 
~p with the worst possible posterior distribution, we obtain 

Theorem 2.3.6. For any positive real constants a, (3, 7 and \, using the notation, 

ft = 7r cxp(- / 3fl)7 

P pexp[—a7f(R)] 1 

ft ^"cxp(— 7r); 

^ = Ai exp[ _ Q A tanh( A)-l~ (r) p 

with P probability at least 1 — e, for any posterior distribution v : Q — > Mi_(M) ; 



1 - 



X(v,p) < X(u,p) 



7Vtanh(^) + /3 iVtanh(A)(l_2) 

log|7f [exp{7Vlog[cosh(^-)]7f(m')} | 



TVtanh(^) +0 
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7Vtanh(A)(l-2 

+ iog< 



X 



log{7f[exp{^log[cosh(^)]7r(m')}]} 
7r|exp 7Vlog[cosh(-^)]7r(m') j 
7r{exp Ar2i g[ C0S h(A)]^( TO ') | 



JVta„h(-J r )-,3 



JVtanh(-A.)(l-4) 



x exp 
1 



alog[cosh(A)] 
tanh(A) 



(vtt) (g> 7r(m') 



1 + i 



7Vtanh(^)+/3 JVtanh(#) -/3 JVtanh(^)(l - }) 



log( 



This result is satisfactory, but in the same time hints at some possible improve- 
ment in the choice of the localized prior JJ, which is here somewhat lacking a variance 
term. We will consider in the remainder of this section the use of 



( 2 - 38 ) ^ ^exp[-aW(.R)-£7rigw(M')' 

where £ is some positive real constant and n = tt 



exp(-/3fl) 



is some appropriate 



conditional prior distribution with positive real parameter [3. With this new choice 

X(v,ji) = a(v --p)if(R) +£0-/l)(5r®5f)(M / ) + X{v,p) -X(jl,p). 

We already know how to deal with the first factor a{y — JJ)tt(R), since the com- 
putations we made to give it an empirical upper bound were valid for any choice 
of the localized prior distribution fx. Let us now deal with S^(v — JJ)(tt (g> tt)(M'). 
Since m!{6,6') is a sum of independent Bernoulli random variables, we can easily 
generalize the result of Theorem 1 1.1 .41 (page|4j to prove that with P probability at 
least 1 — e 

N[l -exp(-£)]i/(Sr ®5r)(M') 

< C$jl Wi* ® 7r)(Af')l < CKt? ® 7f)(m') + X(i/,/Z) - log(e). 

In the same way, with P probability at least 1 — e, 

- JV[exp(-£) - l]-p{n®n){M') 

< -C$_ c fe(5r <g 7r)(M')l < -Cm(7t ® 7r)(m') - log(e). 

We would like now to replace (tt ®7r) (to') with an empirical quantity. In order to do 
this, we will use an entropy bound. Indeed for any conditional posterior distribution 
p : n x M -> M^(6), 

[3C(p, 5?)] = Pv{p - k) (R) + v [X(p, tt) - X(n, tt)] 



< 



7^(p — 7?)(r) + iV log [cosh (^)] z/(p ® 7r)(m') 



iVtanh(^) 

+ X(u, /I) + v [X(p, tt)] - log(e) [> + iy [3C(p, tt) - 3C(Sr, tt)] . 
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Thus choosing (3 = Aftanh(-j^), 

7i/(7r-p)(r) + i/[0C(7r,7r)-aC(p,7r)] 

< An g[cosh($)]z/(p<g>7r)(m / ) +%{v,Ji) -log(e). 

Choosing p — 9, we get 

i/[0C(7r,7r)] < iV log [cosh($)] 1/(7? ®7r)(m') +0C(^/Z) - log(e). 
This implies that 

£z/(tt ®5r)(m') = ^[^(m')] -0C(7r,7r)| + i/[3C(tt, t?)] 

< ^ jlog 7r{exp[^7f(m')] } | 

+ JVlog[cosh(^)]i/(5f ® 5r)(m') + 5C(i/,7*) - log(e). 

Thus 

{£ - iVlog[cosh(^)] }i/(5r <8> 7r)(m') 

< i/{log[7r{exp[^r(m')] }] } + 3C(i/,p) - log(e) 

and 



'[3C(5r,7r)] 



< 



- 1 



v JVlog[cosh(#)] 

+ 3C(i/,7l)-Iog(e) 
Taking for simplicity £ = 27V log[cosh(-^)] and noticing that 
2iVlog[cosh(i)] =-JVlog(l-fo, 



!/jlog[^{exp[£7?(m')] }J | 

+ 3C(i/,7*)-Iog(e). 



we get 

Theorem 2.3.7. Let its p«t 7? = 7r 



exp(-/3fi) 



and n = 7r exp (_ 7r ) ; w/iere 7 is some 



arbitrary positive real constant and (3 = 7Vtanh(-^) 7 so that 7 = ^ logf 1+ % V 

1_ 7\r 

Mft P probability at least 1 — e, 
i/[DC(7r,7f)] <^ log{7f[exp{27Vlog[cosh(^)]7f(m')}]} + 2[3C(i/,/Z) - log(e)] . 
As a consequence 

C^(7r <g> 7r)(m') = C,v(1t ® 7r)(m') - ^[3C(7r ® 5?, 7? (g) 7?)] + 2^[3C(7r, 7?)] 

< ^jlog^r ® 7r[exp((W)] J | 

+ 2*v log{7?[cxp{27Vlog[cosh(^)]7f(m')}]} + 4[0C(^, 77) - log(e)] . 
Let us take for the sake of simplicity ( — 2iVlog[cosh(-^)] , to get 

Cv(w®n)(m') < 3j/|log[7r (g)7f[exp(Cm')] | + 4[X(y,Ji) - log(e)] . 
This proves 
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Proposition 2.3.8. Let us consider some arbitrary prior distribution [i € M.\(M) 
and some arbitrary conditional prior distribution n : M -> M\(Q). Let < N be 
some positive real constant. Let us put tt = 7T cxp ^ and tt — 7r cxp (_ 7J ,), with 

f3 = Ntanh(jf ). Moreover let us put £ = 2iVlog[cosh(-^)] . With P probability at 
least 1 — 2e, for any posterior distribution v € Jfi} + (M\ 

3i/{log[Sr ® tt [exp(Cm')]l } + 5 [3C(z/, 7l) - log(e)] 
i/(tt <8> tt)(M') < — ^ L ^ 



7V[l-exp(-^)] 



7Vtanh(-^) 2 



3f 



log 



expi 



){27Vlog[cosh(^)]m'}]} 

+ 5[0C(i/,7Z)-log(e)] 

In the same way, 

— CJi{tt ® 7?) (to') < 7i|log 5f ® 7? [exp(— Cm.')] } 



271 



logj?? exp{27Vlog[cosh(^)]?f(m')} } 



-41og(e) 



and thus 

-7i(7r«)7r)(M') < 



271 



— |//{log tt <g> 7r[exp(-Cm')] } 
log|7f[exp{2iVlog[cosh(^)]7r(m')} } 



51og(e) 



Here we have purposely kept £ as an arbitrary positive real constant, to be tuned 
later (in order to be able to strengthen more or less the compensation of variance 
terms). 

We are now properly equipped to estimate the divergence with respect to 71, the 
choice of prior distribution made in equation (|2.38i page 11051) . Indeed we can now 
write 



1 - 



5£ 



7Vtanh(.X) + /? jVtanh(A)(l - 2) 7Vtanh(^) 2 



%{v,-p) 



< 



a 


{; 


l0g|7T 


iVtanh(^) 4 




a 






logj?? 


iVtanh(^) - 


l\ 



log(e) 
log(e) 



A^tanh(-^; 

A 



\{v - 7J)7?(r) +iVlog[cosh(^r)](^7f) ® <Jm){m') 
+ (l-^)~V + M) log{^[exp{7V2 1og[cosh(A)]^( TO ')}]| 



108 



Chapter 2. Comparing posterior distributions to Gibbs priors 



+ 



iVtanhm 2 



3v 



logjif (g)7f |^exp{2iVlog[cosh(^)]m'} | 



51og(e) 



— j/ijlog^Tf <g> 7f[exp(-Cm')] } 



2JI 



log|?T exp{27Vlog[cosh(^)]7f(m')} } 



-51og(e) 



It remains now only to replace in the right-hand side of this inequality with 
the worst possible posterior distribution to obtain 

Theorem 2.3.9. Let A > 7 > (3, £, a and £ be arbitrary positive real constants. 
Let us use the notation tt = 7r exp (_ /3fl ), tt = n exp( ^_ Ntanh ^) R ), 7? = 7r cxp (_ 7r ) 7 Jl = 

^exp[-aW(ii)-£7ng>7r(M')] an< ^ ^ us ^ e fi ne the posterior distribution : O — > (M) 
fey 



f aA ^. . 

— <~ cxp< 7r(r) 

d// I iVtanh(A) V ; 



Let us assume moreover that 
a 



+ 



N 



— — log}?? <g> 9 [exp(-Cm')] } 1 • 

[exp(i) - 1J L J J 



+ 



+ 



5£ 



iVtanh(^)+/? /Vtanh(A)(l - ?) iVtanh(^) 



< 1. 



Wri/i P probability at least 1 — e, /or any posterior distribution v : Q — > M+(M), 

- 1 1 " A^h^yr^ 



5£ 



+ 



7Vtanh(A)(l-2) tftanh(£) : 

log|7f [exp{7Vlog[cosh(^-)]7f(m')} } 



+ 



iVtanh(^)+/3 
a 

7Vtanh(A)(l_2; 



+ 



7Vtanh(-£) 2 
+ 



3i/ 



log{ 5r [exp { JV J log [cosh( A)]7f(m')}]} 
logj 7f <g> tt |exp { 2iV log [cosh( )] rri } J | 
— — j^jlog 7? ® 7?[exp(-Cm')] J j 



^V[exp(^) 

+ log|/x {7?[exp{7Vlog[cosh(-^)]7r(m')} | 



Ntanh(-^.)-0 



{ 7 r[exp{iVjlog[cosh(A)] 7? ( m ')}]}~(^)(-i) 
x {?f[cxp{2iVlog[cosh(^)]7T(m')}]}™^ P< ^ ) " 1 J 
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x exp 



{jVlog[cosh(A)] [(^vr) (g) 7f] (to')} 



2a 1 



A' / 



iVtanh(-X)+/3 iVtanh($)-/3 7Vtanh(A)(i - 2) 
+ 7Vtanh(^)2 + iV [exp(i)-l] 



The interest of this theorem lies in the presence of a variance term in the localized 
posterior distribution /2, which with a suitable choice of parameters seems to be an 
interesting option in the case when there are nested models: in this situation there 
may be a need to prevent integration with respect to /i in the right-hand side to 
put weight on wild oversized models with large variance terms. Moreover, the right- 
hand side being empirical, parameters can be, as usual, optimized from data using 
a union bound on a grid of candidate values. 

If one is only interested in the general shape of the result, a simplified inequality 
as the one below may suffice: 

Corollary 2.3.10. For any positive real constants A > 7 > [3, £, a and £, let us 
use the same notation as in Theorem \2.3. 9\ (page{TU3\). Let us put moreover 



Ax 
A 2 
A 3 
A 4 
A, 








a 


iVtanh^)-! 




iVtanh(A)(l_2) 


Q 




a 


iVtanh(^H 




iVtanh(A)(l_2) 


t 








-1] 




a 




a 


TVtanh(^) - 




iVtanh(A)(i_2) 


a 




a 


iVtanh(^)-) 


V 


TVtanh(-X) - (3 






+ ^ 


+ N tanh( 


7 \2 
N> 


ATfexp^) - l] ' 



5£ 



_7 \2 ■ 



+ 



3£ 



2i 



2a(l + #) 



« N > 



C 1 = 2iVlog[cosh(A)] ) 
C 2 =JVlog[cosh(A)]. 

Let us assume that A\ < 1. With P probability at least 1 — e, for any posterior 
distribution v : fl — > M^(M), 



DC(i/, m) < K(v, a, /3, 7, A, C, e) = (l - Aj) 



Aoy 



log ^7? (£i 7? [exp (Ci to') ] 



log<^ ju 



7T ^exp[Cl7?(TO')] 



log^7r ® if [exp(— Qm 1 )] 
(C 2 [(z/tt) ® 5r] (to') 

+ A 5 log(|) 



exp 
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Putting this corollary together with Corollary |2.3.5l fpage ll02| ), we obtain 

Theorem 2.3.11. Let us consider the notation introduced in Corollarv \2.3.5\ (page 
[TUB) and m Theorem [2~3~U\ fvaae [TU3\) and its Corollary [2~3~TTh (page [TUty . Let us 

consider real positive parameters A, j[ < X[ and 7 2 < X' 2 - Let us consider also two 
sets of parameters cu, fy, 7^, Aj, Q>where i = 1,2, both satisfying the conditions 
stated in Corollary \2.3.10\ (page \109\) . With P probability at least 1 — e, for any 
posterior distributions v\,V2 '■ Q — > Mi_(M), any conditional posterior distributions 
pi,p 2 : O x M-> M\(Q), 



This theorem provides, using a union bound argument to further optimize the 
parameters, an empirical bound for uipi{R) — V2P2 which can serve to build 
a selection algorithm exactly in the same way as what was done in Theorem 12.2.41 
(page [75]) . This represents the highest degree of sophistication that we will achieve 
in this monograph, as far as model selection is concerned: this theorem shows that 
it is indeed possible to derive a selection scheme in which localization is performed 
in two steps and in which the localization of the model selection itself, as opposed 
to the localization of the estimation in each model, includes a variance term as well 
as a bias term, so that it should be possible to localize the choice of nested mod- 
els, something that would not have been feasible with the localization techniques 
exposed in the previous sections of this study. We should point out however that 
more sophisticated does not necessarily mean more efficient: as the reader may 
have noticed, sophistication comes at a price, in terms of the complexity of the 
estimation schemes, with some possible loss of accuracy in the constants that can 
mar the benefits of using an asymptotically more efficient method for small sample 
sizes. 

We will do the hurried reader a favour: we will not launch into a study of the 
theoretical properties of this selection algorithm, although it is clear that all the 
tools needed are at hand! 

We would like as a conclusion to this chapter, to put forward a simple idea: 
this approach of model selection revolves around entropy estimates concerned with 
the divergence of posterior distributions with respect to localized prior distribu- 
tions. Moreover, this localization of the prior distribution is more effectively done 
in several steps in some situations, and it is worth mentioning that these situations 
include the typical case of selection from a family of parametric models. Finally, 
the whole story relies upon estimating the relative generalization error rate of one 
posterior distribution with respect to some local prior distribution as well as with 
respect to another posterior distribution, because these relative rates can be esti- 
mated more accurately than absolute generalization error rates, at least as soon 
as no classification model of reasonable size provides a good match to the training 
sample, meaning that the classification problem is either difficult or noisy. 




1 



-K{v 2l a2, 13 2 , 12, \2, £,2X2, f) - log(f). 
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Chapter 3 

Transductive PAC-Bayesian 
learning 

3.1. Basic inequalities 

3.1.1. The transductive setting 

In this chapter the observed sample {Xi,Yi)f =l will be supplemented with a test or 
shadow sample (JQ, Y{) ^^V^ ■ This point of view, called transductive classification, 
has been introduced by V. Vapnik. It may be justified in different ways. 

On the practical side, one interest of the transductive setting is that it is often a 
lot easier to collect examples than it is to label them, so that it is not unrealistic to 
assume that we indeed have two training samples, one labelled and one unlabcllcd. 
It also covers the case when a batch of patterns is to be classified and we are allowed 
to observe the whole batch before issuing the classification. 

On the mathematical side, considering a shadow sample proves technically fruit- 
ful. Indeed, when introducing the Vapnik-Cervonenkis entropy and Vapnik-Cervo- 
nenkis dimension concepts, as well as when dealing with compression schemes, albeit 
the inductive setting is our final concern, the transductive setting is a useful detour. 
In this second scenario, intermediate technical results involving the shadow sample 
are integrated with respect to unobserved random variables in a second stage of the 
proofs. 

Let us describe now the changes to be made to previous notation to adapt them 
to the transductive setting. The distribution P will be a probability measure on the 
canonical space il = (X x and {Xi, Yi)\t\ 1)N wil1 be the canonical process 

on this space (that is the coordinate process). Unless explicitly mentioned, the 
parameter k indicating the size of the shadow sample will remain fixed. Assuming 
the shadow sample size is a multiple of the training sample size is convenient without 
significantly restricting generality. For a while, we will use a weaker assumption than 
independence, assuming that P is partially exchangeable, since this is all we need in 
the proofs. 

Definition 3.1.1. For i = 1, . . . , N, let Ti : ft — > il be defined for any 
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" = K)f=t 1)JV e n by 

n(uj)i +jN = uj i+ (j-i) N , j = 1, . . . , k, 

Ti(u))i = LU i+ kN, 

and Ti(u>) m+jN = u) m +jN, m ^ i, m = 1, . . . , N, j = 0, . . . k. 

Clearly, if we arrange the (k + 1)N samples in a N x (k + 1) array, n performs 
a circular permutation of k + 1 entries on the ith row, leaving the other rows un- 
changed. Moreover, all the circular permutations of the ith row have the form rf , j 
ranging from to k. 

The probability distribution P is said to be partially exchangeable if for any i = 
l,...,N, PoT-r 1 ' = p. 

This means equivalently that for any bounded measurable function h : Q — > K, 
F(ho n )=F(h). 

In the same way a function h defined on CI will be said to be partially exchangeable 
if ho t, = h for any i = 1, . . . , N. Accordingly a posterior distribution p : O — ► 
M^j_(©,T) will be said to be partially exchangeable when p(ui,A) = p\ji{w), A] , for 
any u> £ fl, any i = 1, . . . , N and any A G T. 

For any bounded measurable function h, let us define Ti(h) = Sj=o ^ ° T l ■ 
Let T(h) = Tjv o • • • o T\{h). For any partially exchangeable probability distribution 
P, and for any bounded measurable function h, P[T(ft)] = ¥(h). Let us put 

<Ji{6) = t[fe(Xi) ^ Yj\ , indicating the success or failure of fg 

to predict Yi from Xi, 

1 N 

r\{6) = — ^ o-i(9), the empirical error rate of fg 
»=i on the observed sample, 

1 (fe+l)JV 

r 2(0) — - — 7 cr i(^) J the error rate of fg on the shadow sample, 

- lfts n(g) + Wg) i {k ^ N . , , , . 

rW= fc +i = (fcnw ^ r g f; error 

v ; »=i rate of fg, 

Ri{0) = F[f e (Xi) + Yi\ , the expected error 

rate of fg on the ith input, 

1 N 

R{6) = — ^Ri{6) = P[ri(0)] = P[r 2 (6»)] , the average expected 

i=l 

error rate of on all inputs. 

We will allow for posterior distributions p : CI —> Mi_(0) depending on the shadow 
sample. The most interesting ones will anyhow be independent of the shadow labels 
Y/v+i, . . . , V(fe+i)iv- We will be interested in the conditional expected error rate of 
the randomized classification rule described by p on the shadow sample, given the 
observed sample, that is, P[p(r2)|(Xj, ^i)£Li] • This is a natural extension of the 
notion of generalization error rate: this is indeed the error rate to be expected 
when the randomized classification rule described by the posterior distribution p 
is applied to the shadow sample (which should in this case more purposefully be 
called the test sample). 



3.1. Basic inequalities 



113 



To see the connection with the previously defined generalization error rate, let us 
comment on the case when IP is invariant by any permutation of any row, meaning 
that 

P[ft(uos)] = F[h(wj\ for all s 6 6({i + jN; j = 0, . . . , k}) 
and all i = 1, ... ,7V, where &(A) is the set of permutations of A, extended to 
{1, . . . , (k + 1)N} so as to be the identity outside of A. In other words, P is as- 
sumed to be invariant under any permutation which keeps the rows unchanged. 
In this case, if p is invariant by any permutation of any row of the shadow sam- 
ple, meaning that p(u> o s) = p(u) G M+(6), s e &({i + jN;j = l,...,fc}), 
i = l,...,N, then P^KXi, YO^] = £ £<=i F[p(a i+N )\(Xi, Y^] , meaning 
that the expectation can be taken on a restricted shadow sample of the same size as 
the observed sample. If moreover the rows are equidistributed, meaning that their 
marginal distributions are equal, then 

F[p(r 2 )\(X i ,Y^l l ]=P[p(a N+1 )\(X u Y i )f =1 ]. 
This means that under these quite commonly fulfilled assumptions, the expectation 
can be taken on a single new object to be classified, our study thus covers the case 
when only one of the patterns from the shadow sample is to be labelled and one is 
interested in the expected error rate of this single labelling. Of course, in the case 
when P is i.i.d. and p depends only on the training sample (Xi,Yi)fL 1 , we fall back 
on the usual criterion of performance P[p(r2)|(^i)^l 1 ] = p(R) = p(Ri). 



3.1.2. Absolute bound 



Using an obvious factorization, and considering for the moment a fixed value of 6 
and any partially exchangeable positive real measurable function A : Q — > R+, we 
can compute the log-Laplace transform of r\ under T, which acts like a conditional 
probability distribution: 



log{r[exp(-Ar 1 )]} =^log{r i [exp(-A CTi )]| 

i=l 

' i N 



< AT lot 



= -A*x(r), 



where the function was defined by equation (jl.ll page [2]). Remarking that 
rjexp A[$A.(r)— n] | = exp[A<f>A(f T )]T[exp(-Ari)] we obtain 

Lemma 3.1.1. For any 9 S 6 and any partially exchangeable positive real mea- 
surable function A : f2 — > K+, 

r|exp A{$x,[r(0)] -ri{6)} X < 1. 

We deduce from this lemma a result analogous to the inductive case: 



Theorem 3.1.2. For any partially exchangeable positive real measurable function 
A : Q x G — > M.+ , for any partially exchangeable posterior distribution 7r : Q — > 



jexp 


sup p 







A[$ A (r)-n] 



X(p,7T) 



< 1. 
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The proof is deduced from the previous lemma, using the fact that 7r is partially 
exchangeable: 



exp 



sup p 



.p£M\(&) 



A[$A(r)-n]j -X( P ,tt) 

7r|cxp A[$ A.(r) - n] 1 1 = P jrvrjcxp A [$ x, (r) - n] | 

7r|rexp A[$x,(r) - n] | I < 1. 



3.1.3. Relative bounds 



Introducing in the same way 



N 



(fc+l)iV 

(k + l)N E 
we could prove along the same line of reasoning 



and m(M') = (k + 1)N E j 1 ^ ^] -l|>(*<) ^ ^] 



Theorem 3.1.3. For an?/ reaZ parameter X, any 9 £ 0, any partially exchangeable 
posterior distribution ir : Q — > M^O), 



exp 



sup A 



< 1. 



p{*x [r(-)-r(0),m(-,0)]} 

- [p( ri )-n(?)]] -3C(p,tt) 
where the function ^ \_ was defined by equation \1.21\ page \35\) . 



Theorem 3.1.4. For any real constant 7, for any 9 e 0, for any partially ex- 
changeable posterior distribution ir : Q — > M^(O), 



exp 



sup < — Np< log 

pGM 1 , (0) ' 1 



1 - tanh(^ 



)[r(-)-r(9)]}} 
j[p( ri )- ri (9)} -Nlog[coak(i)]p[m'(;e)] -X( P ,tt) 



< 1. 



This last theorem can be generalized to give 

Theorem 3.1.5. For any real constant 7, for any partially exchangeable posterior 
distributions it 1 , it 2 : ft — > M^G), 



exp 



sup < -JVlogf 1 - tanh(^) [ Pl (r) - p 2 (r)] } 

Pi,P2<£M\(0) I 1 J 
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- -f[pi(ri) - P2{n)] - A r log[cosh(^)]pi (g) p 2 (m') 

-X( Pl ,7r 1 )-X(p 2 ,7T 2 ) 

To conclude this section, we see that the basic theorems of transductive PAC- 
Bayesian classification have exactly the same form as the basic inequalities of in- 
ductive classification, Theorems 11.1.41 (page , 11.4.21 (page and 11.4.31 (page [37]) 
with R{9) replaced with r(9), r(9) replaced with r 1 (9) and M'(9,9) replaced with 
m{9,9). 

Thus all the results of the first two chapters remain true under the hypotheses 
of transductive classification, with R{9) replaced with r(9), r{9) replaced with r\{9) 
and M'(9, 9 ) replaced with m(9, 9). 

Consequently, in the case when the unlabelled shadow sample is observed, it is 
possible to improve on the Vapnik bounds to be discussed hereafter by using an ex- 
plicit partially exchangeable posterior distribution n and resorting to localized or 
to relative bounds (in the case at least of unlimited computing resources, which of 
course may still be unrealistic in many real world situations, and with the caveat, 
to be recalled in the conclusion of this study, that for small sample sizes and com- 
paratively complex classification models, the improvement may not be so decisive). 

Let us notice also that the transductive setting when experimentally available, 
has the advantage that 

1 ( k + 1 ') N 

>m{9,9')>r{9)-r{9'), 9,9' G 6, 

is observable in this context, providing an empirical upper bound for the difference 
r(0) — p(f) for any non-randomized estimator 9 and any posterior distribution p, 
namely 

r(9)<p(r)+p[d(-,9)]. 

Thus in the setting of transductive statistical experiments, the PAC-Bayesian frame- 
work provides fully empirical bounds for the error rate of non-randomized estima- 
tors 9 : — > 0, even when using a non-atomic prior it (or more generally a non- 
atomic partially exchangeable posterior distribution it), even when is not a vector 
space and even when 9 i— » R(9) cannot be proved to be convex on the support of 
some useful posterior distribution p. 

3.2. Vapnik bounds for transductive classification 

In this section, we will stick to plain unlocalized non-relative bounds. As we have 
already mentioned, (and as it was put forward by Vapnik himself in his seminal 
works), these bounds are not always superseded by the asymptotically better ones 
when the sample is of small size: they deserve all our attention for this reason. We 
will start with the general case of a shadow sample of arbitrary size. We will then 
discuss the case of a shadow sample of equal size to the training set and the case of 
a fully exchangeable sample distribution, showing how they can be taken advantage 
of to sharpen inequalities. 
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3.2.1. With a shadow sample of arbitrary size 



The great thing with the transductive setting is that we are manipulating only 
r% and r which can take only a finite number of values and therefore are piecewise 
constant on 0. This makes it possible to derive inequalities that will hold uniformly 
for any value of the parameter 9 £ 0. To this purpose, let us consider for any value 
9 £ of the parameter the subset A(9) C of parameters 6' such that the 
classification rule fgi answers the same on the extended sample (Xi)^jy^ N as fg. 
Namely, let us put for any 9 £ 

A(0) = {e 1 e 0; fy{Xi) = fe(Xi),i = 1, . . . , (k + 1)N}. 

We see immediately that A(9) is an exchangeable parameter subset on which n and 
ri and therefore also r take constant values. Thus for any 9 £ we may consider 
the posterior pg defined by 

and use the fact that pg (ri ) = r\ (9) and pg (r) = r(6) , to prove that 



Lemma 3.2.1. For any partially exchangeable positive real measurable function 
A:fix9^1 such that 



(3.1) A(w, 6 1 ) = A(w, 0), 6e®,e' e A(0), we!!, 

and any partially exchangeable posterior distribution tt : r2 — >■ Mi_(0), with P prob- 
at least 1 — e, /or any £ 0, 



r , , log{e7rrA(6»)l } 

gt A [ g) l JJJ <n(6>). 

We can then remark that for any value of A independent of uj, the left-hand side 
of the previous inequality is a partially exchangeable function of uj £ J7. Thus this 
left-hand side is maximized by some partially exchangeable function A, namely 



argmax( $ ^[^)] + l0g{£7r f W]} 



is partially exchangeable as depending only on partially exchangeable quantities. 
Moreover this choice of A(w, 9) satisfies also condition (|3.ip stated in the previous 
lemma of being constant on A(0), proving 



Lemma 3.2.2. For any partially exchangeable posterior distribution n : 
JYti (0), with P probability at least 1 — e, for any 9 £ and any A £ R+, 

^[^)] + '° e{e f wl} < r .w. 

Writing r = ri ^[ 2 and rearranging terms we obtain 
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Theorem 3.2.3. For any partially exchangeable posterior distribution 7r : — ^ 
M;. (0), with P probability at least 1 — e, for any 9 G 0, 

„m < *±i u \ N ' , " Z - ifl. 

fc asr+ 1 - exp(-j^) fc 

If we have a set of binary classification rules {fe',9 £ 0} whose Vapnik-Cervo- 
nenkis dimension is not greater than h, we can choose it such that tt[A(0)1 is 

h - h 



independent of 9 and not less than — — , as will be proved further on in 

\e[k + l)N) 

Theorem (page [Tgg|) . 

Another important setting where the complexity term — log{7r[A(fl )] | can eas- 



i ly be controlled is the case of compression schemes, introduced by iLittle et al 



<ll986h . ]t goes as follows: we are given for each labelled sub-sample (Xi,Yi)i^j, 
J C {1, . . . , N}, an estimator of the parameter 

e[(Xi, Y^j] =6j, J c {1, . . . , N}, \J\ < h, 

where 

N 

9: |J(Xxy) fe ^0 

k=l 

is an exchangeable function providing estimators for sub-samples of arbitrary size. 
Let us assume that 9 is exchangeable, meaning that for any k = 1, . . . , N and any 
permutation a of {1, . . . , k} 

0[{x l ,y l ) k l=1 ] =?[(av (i)) y CT(i) XL 1 ] ) {x l ,y l )ti G (X x . 
In this situation, we can introduce the exchangeable subset 

{#,/; J c {1, ...,(* + 1)JV}, | J| a)c6, 
which is seen to contain at most 



E 



(fc + l)^ < ( r[k - I !.V 



classification rules — as will be proved later on in Theorem 14.2.31 (page I144|) . Note 
that we had to extend the range of J to all the subsets of the extended sample, 
although we will use for estimation only those of the training sample, on which 
the labels are observed. Thus in this case also we can find a partially exchangeable 
posterior distribution 7r such that 



n[A(9j)} 



> 



e(k + l)N 



We see that the size of the compression scheme plays the same role in this complexity 
bound as the Vapnik-Cervonenkis dimension for Vapnik-Cervonenkis classes. 

In these two cases of binary classification with Vapnik-Cervonenkis dimension 
not greater than h and compression schemes depending on a compression set with 
at most h points, we get a bound of 
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1 — exp 



ra(0) < ^ mf 

k AGK+ 



-£n(*) 



log(e) 



iV 



1 — exp( 



A/ 



n(0) 



Let us make some numerical application: when N — 1000, h — 10, e = 0.01, and 
inf e ri = r x {6) = 0.2, we find that r 2 (0) < 0.4093, for k between 15 and 17, 
and values of A equal respectively to 965, 968 and 971. For k = 1, we find only 
t%(0) < 0.539, showing the interest of allowing k to be larger than 1. 



3.2.2. When the shadow sample has the same size as the training 
sample 



In the case when k = 1, we can improve Theorem 13.1.21 by taking advantage of 
the fact that T^tTj) can take only 3 values, namely 0, 0.5 and 1. We see thus that 
Ti ((jj) — 3? _x pi can take only two values, and h — $ a ( ^ ) , because $ a (0) = 
and 4>a (lT= 1. Thus 

T^-SxfT^)] = [1 -|1- 2T i (a i )|] [§ - $ x (|)]. 
This shows that in the case when k = 1, 



X N 

log{T[exp(-Ar 1 )] } = -Xr + - ^T^) - $ x [T t (<r. 



N 



-Xr 



N 



£[1- 11-2^)1] 



i=i 



<-Ar + A[|-$x(|)] [l-|l-2f|]. 
Noticing that | — $ x (|) = y log[cosh( ^)] , we obtain 

Theorem 3.2.4. For any partially exchangeable function X : Q x — > R + , /or any 
partially exchangeable posterior distribution ir : f2 — > Mil (0), 



jexp 


sup p 


A(r - ri) 









-iVlog[cosh(4)](l-|l-2r|)] -3C(p,tt)J j < 1. 
As a consequence, reasoning as previously, we deduce 

Theorem 3.2.5. In the case when k = 1, /or any partially exchangeable posterior 
distribution ir : fl — > Mi_(@)> MfifA P probability at least 1 — e, for any 6 £ and 
any A £ K+, 

f (fl) - f log[cosh(^)] (1|1 2r{9)\) + l0g{£7r[ , AW]} < rxCtf); 
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and consequently for any 9 6 0, 

rm log{«r[A(fl)]} 

r 2 (») < 2 inf * ~ri(g). 

agr + 1- ^logfcosh^)] 

In the case of binary classification using a Vapnik-Cervonenkis class of 
Vapnik-Cervonenkis dimension not greater than h, we can choose n such that 
— log{7r [A(#)l } < /ilog(^p) and obtain the following numerical illustration of 

this theorem: for N = 1000, h = 10, e = 0.01 and inf e r x = r x @) = 0.2, we find an 
upper bound r2(9) < 0.5033, which improves on Theorem 13 . 2 . 31 but still is not un- 
der the significance level ^ (achieved by blind random classification). This indicates 
that considering shadow samples of arbitrary sizes some noisy situations yields a 
significant improvement on bounds obtained with a shadow sample of the same size 
as the training sample. 



3.2.3. When moreover the distribution of the augmented sample is 
exchangeable 

When k = 1 and P is exchangeable meaning that for any bounded measurable 
function h : Q -> K and any permutation s e 6({1, . . . , 2iV}) V[h(uos)] = f[h(cu)] , 
then we can still improve the bound as follows. Let 



s£&({N+l,...,2N}) 

Then we can write 

1 — |1 — 2T i (er i )| — {<7i - <t 1+ n)' 2 = o~% + &i+n - ^o-iVi+N- 
Using this identity, we get for any exchangeable function A : x — > M + , 

rjexp A(f-ri) - log[cosh(2^)] + cr l+N - 2aiCr l+N ) |< 1. 



Let us put 
(3.2) 

(3.3) 

With this notation 
Let us notice now that 



A(A) = 2^1og[c OS h(^)], 

N 

V 



i=l 

{exp{\[r - n - A(\)v]}} 



< 1. 



T'Ue)] = r{9)-r 1 {9)r 2 {9). 



Let 7r : SI — ^ Mi_(©) be any given exchangeable posterior distribution. Using the 
exchangeability of P and tt and the exchangeability of the exponential function, we 
get 

p|tt exp{A[r-ri - A(f-nr 2 )]} } = P { 7T exp{A[r - n - AT'(v)] } } 
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< p{tt T'exp{A[r-ri - Av]} X = F^T'tt exp{ X[r - r x - Av] } j 
= p|tt exp{A[r-r! - A«] } } = p{Ttt cxp{ A [r - n - Aw] } | 
= p|tt Texp{A[r - n - Av] } } < 1. 
We are thus ready to state 

Theorem 3.2.6. In the case when k = I, for any exchangeable probability dis- 
tribution V, for any exchangeable posterior distribution 7r : $1 — > JVt\(Q), for any 
exchangeable function X : Q x 



exp 



sup p\X[r -r x - A(X)(r - nr 2 )] } - 0C(p,n) 



pGMi. (0) 



<1, 



where A(X) is defined by equation V3. 6 A vaae 
We then deduce as previously 

Corollary 3.2.7. For any exchangeable posterior distribution 7r : O — > M^G), for 
any exchangeable probability measure P € M+(r2) ; for any measurable exchangeable 
function X : fl x — > R+, wif/i P probability at least 1 — e, /or any 0g8, 



< n (9) + A(X) [r(6) - n (0)r 2 (0); 
where A(X) is defined by equation H3.S[ vaae \119\) . 



log{ e7 r[A(0)]} 



In order to deduce an empirical bound from this theorem, we have to make 
some choice for X(uj,9). Fortunately, it is easy to show that the bound holds uni- 
formly in A, because the inequality can be rewritten as a function of only one 
non-exchangeable quantity, namely r% (9) . Indeed, since r 2 = 2r — n , we see that 
the inequality can be written as 



r{9) < n {9) + A(X) [r(9) - 2r(9)n (9) + n (9) 2 ] 
It can be solved in r\ (9) , to get 

ri(0)>/(A,r(0),-log{e7r[A(0); 

where 

f(X,r,d)= [2A(A)]" 1 (2rA(A)-l 



log{e^[A(0)] 



+ 2rA(X)] 2 + 4A(A){r[l - A(X)] f} 

Thus we can find some exchangeable function A(w, 9), such that 



/ A(w,0),r(0),-log{e7r[A(0)]} = sup / /3,r(0),-log{e7r[A(0)]} 

v J /3GK+ v ' 

Applying Corollary 13. 2. 71 fpage fT^0|) to that choice of A, we see that 
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Theorem 3.2.8. For any exchangeable probability measure P G JA\(Vl), for any 
exchangeable posterior probability distribution ir : fl — * Ml(6), with P probability 
at least 1 — e, for any 9 <E 9, for any A € K+, 

r(9) < n(d) + A{\) [¥(9) - ri(g)ra(g)] - tos W A ( g >] ) , 

where A(X) is defined by equation iS.Sl paae \119\ ). 
Solving the previous inequality in ^(0), we get 

Corollary 3.2.9. Under the same assumptions as in the previous theorem, with P 
probability at least 1 — e, /or any £ 6, 



r 2 (0) < inf 



ri (^{l + ^log[cosh(^)]} 



21og{e7r[A(0)]} 



l_^log[cosh(^)] [l-2ri(0)] 



Applying this to our usual numerical example of a binary classification model 
with Vapnik-Cervonenkis dimension not greater than h = 10, when N — 1000, 
inf n = n(9) = 10 and e = 0.01, we obtain that r 2 {9) < 0.4450. 



3.3. Vapnik bounds for inductive classification 
3.3.1. Arbitrary shadow sample size 



We assume in this section that 

P = ((g) Pi) 6Mi{[(lxf ]" , 

where Pi 6 M^(X x yj: we consider an infinite i.i.d. sequence of independent 
ntm-identically distributed samples of size N, the first one only being observed. 
More precisely, under P each sample (Xi+jN, 3^+jJv)£i is distributed according 
to ®i=i^i an d they are all independent from each other. Only the first sample 
(Xi 1 Y i )f =1 is assumed to be observed. The shadow samples will only appear in the 
proofs. The aim of this section is to prove better Vapnik bounds, generalizing them 
in the same time to the independent non-i.i.d. setting, which to our knowledge has 
not been done before. 

Let us introduce the notation P' [/i(w)] = P[ft.(w) | (Xi, Y^fL-^ , where h may be 

any suitable (e.g. bounded) random variable, let us also put ft = [(X x y)"^] 1 *. 

Definition 3.3.1. For any subset A C N of integers, let £(A) be the set of circular 
permutations of the totally ordered set A, extended to a permutation ofN by taking 
it to be the identity on the complement N \ A of A. We will say that a random 
function h : CI — > R is k-partially exchangeable if 

h{uj o s) = h(u>), s £ £({i + jN ; j = 0, . . . , k}) , i = 1, . . . , N. 

In the same way, we will say that a posterior distribution 7r : — ^ M+(0) is 
k-partially exchangeable if 



tt(wos) = S M\{Q), s e €({i+jN; j = Q,...,k}),i= 1,...,N. 
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Note that P itself is fc-partially exchangeable for any k in the sense that for any 
bounded measurable function h : — > K 

¥[h(u o s)] = ¥[h(u)] , s E + jN ; j = 0,. .., fc}) ,i = 1, . . . ,N. 
Let A k (6) = {9' eQ; [f 9 '{Xi)]^ 1)N = [f 9 (Xi)]^ 1)N }, 9 e 6,fc e N*, and let 



also rfe(^) 



1 



(fc+l)JV 



^ l[/e(Xi) ^ Fj. Theorem [3T2] shows that for any 



(k + l)N 

positive real parameter A and any fc-partially exchangeable posterior distribution 



< e. 



pjexp BupA[$A(fjk)-ri] + log{ e7 r fc [A fc (6»)] } 
I Leee N 

Using the general fact that 

P[exp(ft)] = pjp'[exp(ft)]j > pjexp[P'(ft)]j, 

and the fact that the expectation of a supremum is larger than the supremum of 
an expectation, we see that with P probability at most 1 — e, for any 6 8, 



For short let us put 



?{log{«r fc [A k (0)]}} 



4(0) = -log{ e7 r fe [A fc (0)]}, 
4(0) = -P'{log{e7r fc [A fc (0)]}}, 

dfc(0) = -P{log{«r fc [A fc (0)]}}. 

We can use the convexity of and the fact that P'(Ffc 
that 

'n{0) + kR(ey 



to establish 



*{**[r fc (*)]}>** 



fc + 1 



We have proved 



Theorem 3.3.1. Using the above hypotheses and notation, for any sequence 7Tfc : 
f2 — » M:l(0), where -Kk is a k-partially exchangeable posterior distribution, for any 
positive real constant \, any positive integer k, with P probability at least 1 — e, for 
any 9 £ 0, 



n{9)+kR{9) 



fc + 1 



< ri(0) + 



A ' 



We can make as we did with Theorem 11.2.61 (page ITTj) the result of this theorem 
uniform in A 6 {ot> ; j G N*} and fc 6 N* (considering on fc the prior ^ and on 

j the prior jrpp^ ), and obtain 
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Theorem 3.3.2. For any real parameter a > 1, with P probability at least 1 — e, 
for any 9 6 O, 

R{9) < 



1 — exp< 




-log[fc(fc + l)j(j + l)]}| 


k 
k+1 


1 — exp 







ri(g) 



As a special case we can choose 7Tfc such that logl^ [Afc(0)] } is independent of 
9 and equal to log(Otfc), where 

m k = \{[fe(X l )] i l + 1 1)N ;9eQ}\ 

is the size of the trace of the classification model on the extended sample of size 
(k+l)N. With this choice, we obtain a bound involving a new flavour of conditional 
Vapnik entropy, namely 

d' k (9)=P[log(m k )\(Z l )l 1 ] -log(e). 

In the case of binary classification using a Vapnik-Cervonenkis class of Vapnik- 
Cervonenkis dimension not greater than h = 10, when N — 1000, infe r% — r\{9) = 
0.2 and e = 0.01, choosing a = 1.1, wc obtain R(9) < 0.4271 (for an optimal value 
of A = 1071.8, and an optimal value of k = 16). 



3.3.2. A better minimization with respect to the exponential parameter 

If we are not pleased with optimizing A on a discrete subset of the real line, we 
can use a slightly different approach. From Theorem 13. 1.21 fpage !113|) . we see that 
for any positive integer k, for any fc-partially exchangeable positive real measurable 
function A : £1 x — » M + satisfying equation (|3.1[ page 1 116} — with A(0) replaced 
with A fe (0) — for any e 6)0, 1) and 77 6)0, 1), 



exp 



supA[$ A (?-fc) -n] +log{e777r fc [A fc (60] } 



< 



therefore with P probability at least 1 — e, 



exp 



sup A [$ x (f fe ) - ri] + log{e777r fc [A fe (9)] } 



< 



and consequently, with P probability at least 1 — e, with P' probability at least 1 — 77, 
for any 9 G 9, 

log{6777r fc [A fc (fl)]} 
* a(7/c) H ^ r — < ri. 



A 



Now we are entitled to choose 



X{oj,9) G arg max $ A - {r k ) 



log{e?77r fc [A k (9)]} 
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This shows that with P probability at least 1 — e, with P' probability at least 1 — rj, 
for any 9 £ 6, 

sup 4>A(r fe ) . < ri, 

AGR+ A 

which can also be written 

Thus with P probability at least 1 — e, for any 9 £ 6, any A e K + , 



$A(r fc )-ri 



A 



< 



log(??) 
A 



l-n 



iog(»?) 



On the other hand, $ a. being a convex function, 



dk(e) 
A 



>^[P'(r,)] -n 

= $; ^+n 



«V fe + 1 

Thus with P probability at least 1 — e, for any 9 £ 6, 



- n - 



kR + n . , 
— < mf $ A 

K + 1 AGK+ Iv 



n(l - r?) +r/ + 



^-log^a-r?) 



A 



We can generalize this approach by considering a finite decreasing sequence rj 
1 > ?7i > rj2 > • ■ ■ > rjj > r/j + i = 0, and the corresponding sequence of levels 



^,0<i<J, 



Lj+i = l - n 



lQg( J) - log(e) 

A 



Taking a union bound in j, we see that with P probability at least 1 — e, for any 
9 6 ©, for any A e R+, 



$^(r fe ) - n > L 3 



<Vj, j = 0,...,J+l, 



and consequently 



rffc + log(J) 
A 



< 



/ P' 

JO 



^ i- \ d k + log( J) 

(r-fe) - n > a 



J+i 



rfa < ^r/j_i(Lj -Lj-i) 



l-n 



log( J) - log(e) - log(ryj) 
A 



Let us put 
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d'L[e, im)U] = + i°g(J) - iog(ni) 



j-i 



3=1 



^ jlog (JL.) . log (l_ 



"J 



We have proved that for any decreasing sequence (?7j)/ =1 , with P probability at 



least 1 — e, for any # £E 0, 



-+ JV 



- ??./) + 77.7 + 



<[fl,fo)/=l] 

A 



Remark 3.3.1. We can for instance choose J — 2, 7/2 = j^y, Vi = i og (ioJV) -• 
resulting in 



loglog(KW) log 
log(lOiV) 



4 = d' k + log(2) + loglog(lOiV) + 1 



ION 



In the case where N = 1000 and for any e 6)0, 1), we get d'l < d' k + 3.7, in the case 
where N = 10 6 , we get d'l < d' k +AA, and in the case N = 10 9 , we get d'l < d' k +4.7. 

Therefore, for any practical purpose we could take d'l = d' k + 4.7 and r\j = 
in the above inequality. 

Taking moreover a weighted union bound in k, we get 

Theorem 3.3.3. For any e (E)0, 1), any sequence 1 > rji > • • • > r/j > 0, any 

sequence ir^ : fl — > Mi_(0), where ir^ is a k-partially exchangeable posterior distri- 
bution, with P probability at least 1 — e, for any 6 £ 0, 



R(0) < i n f £±I i n f $-1 

fcSN* fc AeR+ — 



ri(d)+»7j[l-n(d)] 

, 4'[0,fe)/=i] +iog[fc(fc + i)] 



A 



1(9) 



Corollary 3.3.4. For any e e)0, 1), for any N < 10 9 , with P probability at least 
1 — e, /or any 6> S 0, 



fcSN* AGR+ fc ' 



i?(6>) < inf inf 



P'flog^)!^)^] -log(e)+log[*(fc + l)] +4.7 



TV 



n(0) 



Let us end this section with a numerical example: in the case of binary classi- 
fication with a Vapnik-Cervonenkis class of dimension not greater than 10, when 
N = 1000, inf e n = r x (6) = 0.2 and e = 0.01, we get a bound R(0) < 0.4211 (for 
optimal values of k = 15 and of A = 1010). 



3.3.3. Equal shadow and training sample sizes 

In the case when k = 1, we can use Theorem 13.2.51 (page lll8j ) and replace &^}(q) 

N 

with {l — 21L x log[cosh(^)] } q, resulting in 
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Theorem 3.3.5. For any e £)0, 1), any N < 10 9 , any one-partially exchangeable 
posterior distribution tk\ : Q — > M^(0), with P probability at least 1 — e, for any 
9eO, 



R(9) < inf 

A6K+ 



{l + ^l 0g [ c08h( _A.)]} riW + _L + 2 



d[{9)+4.7 
A 



l-^log[cosh(^)] 



3.3-4- Improvement on the equal sample size bound in the i.i.d. case 

Finally, in the case when P is i.i.d., meaning that all the Pi are equal, we can improve 
the previous bound. For any partially exchangeable function A : £1 x — > R + , we 
saw in the discussion preceding Theorem 13.2.61 (page I120p that 



T 



exp[A(r fe -n) - A(X)i 



< 1, 



with the notation introduced therein. Thus for any partially exchangeable positive 
real measurable function A : SI x 6 -» R+ satisfying equation (|3.1l page I116|) , any 
one-partially exchangeable posterior distribution wi : ft — * M+(Q), 



{exp[sup \[r k (9) - n(0) - A(X)v{d)] + log[e7Ti [A(0)]1 ) 

L l 8EO 11 J > 



< 1. 



Therefore with P probability at least 1 — e, with P' probability 1 — rj. 

r k (9) < n(d) + A(\)v(8) + ~[di(0) -log(r?)]. 

, Ji(6») -log(r?)l 
We can then choose A(w, 0) £ arg min yl(A H -, which satis- 

fies the required conditions, to show that with P probability at least 1 — e, for any 
9 £ 0, with P' probability at least 1 — ij, for any A £ R+, 



f fc (0) <ri(0)+A(AM0) 



A 



We can then take a union bound on a decreasing sequence of J values r)i > • • • > 
r/j of ry. Weakening the order of quantifiers a little, we then obtain the following 
statement: with P probability at least 1 — e, for any 9 £ 0, for any A £ R+, for any 
.7 = 1,..., J 



- n (*) - A(A M 0) - Jl(g) V° g(J) > 



< 



Consequently for any A £ 



ffc(e)-n(d)-A(A)«(d) 

<_logM + ??J 



di(0) + log(J) 



A 



l-ri(0) 



log(J) - log(e) - log(?7j 



A 



j-i 
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Moreover P'[u(#)] = ri ^ R — r\R, (this is where we need equidistribution) thus 



proving that 



R - n < A{\) 



R + n- 2nR 



A 



2 - 2 
Keeping track of quantifiers, we obtain 

Theorem 3.3.6. For any decreasing sequence (r]j)j =1 , any e 6)0,1), any one- 
partially exchangeable posterior distribution 7r : SI — ^ M\_ (6), with P probability at 
least 1 — e, for any 9 G 9, 



R(9) < inf 

AGR+ 



{l + g log[cosh(^)] + ggj + 2,41 - n(9)] 

l_^l og [ cosh( _^)][i_2r 1 (0)] 



3.4. Gaussian approximation in Vapnik bounds 
3.4-1- Gaussian upper bounds of variance terms 

To obtain formulas which could be easily compared with original Vapnik bounds, 
we may replace p — $ a (p) with a Gaussian upper bound: 

Lemma 3.4.1. For any p G (0, i), any a G M + . 



p-$a(p) < -p(l-p). 



For anyp G (U), 



p-$a(p) < g- 

Proof. Let us notice that for any p G (0, 1), 

pexp(— a) 



d 



s-[-«*.(p)] 



9a 

a 2 c 



[-«*»(?)] 



< 



1 — p + pexp(— a)' 
_ pexp(-a) 
1 — p + pcxp(— a) 

fp(i-p) pe(o,|), 



1 - 



pexp(— a) 



1 — p + pcxp(— a) 



Thus taking a Taylor expansion of order one with integral remainder: 

r 

—ap+ / p(l — p)(a — b)db 
Jo 



-a$(a) < < 



-ap + 



= ~ap+ yP(l -P), pe(0, i), 
J \(a-b)db = -ap+ C ^, P 6(3,1)- 



This ends the proof of our lemma. □ 
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Lemma 3.4.2. Let us consider the bound 
-l 



B(q,d) = I I 
Let us also put 



2d 
~N 



d 
N 



2dq(l - q) d? 
TV + iV2 



q G R + , d G R+. 



fi?(q,d) B(g,d)<5. 
B{q,d) = { /-j- . 

otherwise. 



For any positive real parameters q and d 



inf <B(q,d). 
A6K+ lv V A .' 



Proof. Let p = inf $ x [q + - . For any A G R+, 

A n \ A 



p-^(pA|)[l-(pAi)] <$a(p) <g + 



Thus 



2d(pAi)[l-(pAi) 



A/ 



< 



Then let us remark that B(q, d) = sup \p' G R+ ; p < q + \j 2dp £L p ) | . if 

moreover | > B(q,d), then according to this remark | > g + ^/ ^ > p. Therefore 
p < i, and consequently p < q + y 2dp ^~ p ) ! implying that p < B{q,d). □ 



3-4-2. Arbitrary shadow sample size 

The previous lemma combined with Corollarv l3.3.4l fpage ll25| ) implies 



Corollary 3.4.3. Let ms use £/ie notation introduced in Lemma \3.4-2\ (vaae \T^) . 
For any e G)0, 1), any integer N < 10 9 , luii/i P probability at least 1 — e, for any 
(9G9, 

«(*) < mf ^{S [ ri (9) + J^, d' k {6) + Iog[*(* + 1)] + 4.7] } - 



3-4-3. Equal sample sizes in the i.i.d. case 

To make a link with Vapnik's result, it is useful to state the Gaussian approximation 
to Theorem 13.3.61 (page 1127] ). Indeed, using the upper bound A(X) < where 
A(X) is defined by equation (|3.2[) on page 11191 we get with P probability at least 
1 - e 



■ - A r „ „ nl 2d!' 2dX(R + n-2riR) 



which can be solved in R to obtain 
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Corollary 3.4.4. With P probability at least 1 — e, for any 6 € ; 

S(9)<''i(9) + ^[l-2r 1 (9)]+2, J 



This is to be compared with Vapnik's result, as proved in IVapnikl (| 19981 . page 
138): 

Theorem 3.4.5 (Vapnik). For any i.i.d. probability distribution P, with P prob- 
ability at least 1 — e, /or any # s O, putting 

d v =log[P(9li)] +log(4/e), 



m<nw 1 2dv 1 - /4dyri(0) 1 44 



N y N N 2 
j =1 such that rjj = r) 2 = y^y 



Recalling that we can choose (??j)? =1 such that r\j = r\i = (which brings a 



negligible contribution to the bound) and such that for any N < 10 9 , 

d'{{6) < P[log(9l!) | - log(e) + 4.7, 

we see that our complexity term is somehow more satisfactory than Vapnik's, since 
it is integrated outside the logarithm, with a slightly larger additional constant 
(remember that log 4 ~ 1.4, which is better than our 4.7, which could presumably 
be improved by working out a better sequence rjj, but not down to log(4)). Our 

d" 

variance term is better, since we get ri(l — ri), instead of r\. We also have 

instead of because we use no symmetrization trick. 

Let us illustrate these bounds on a numerical example, corresponding to a situ- 
ation where the sample is noisy or the classification model is weak. Let us assume 
that N = 1000, infe r\ — r\{6) — 0.2, that we are performing binary classification 
with a model with Vapnik-Cervonenkis dimension not greater than h = 10, and 
that we work at confidence level e = 0.01. Vapnik's theorem provides an upper 
bound for R{6) not smaller than 0.610, whereas Corollarv l3.4.4l gives R(9) < 0.461 
(using the bound d'{ < d[ + 3.7 when N = 1000). Now if we go for Theorem EOl)! 
and do not make a Gaussian approximation, we get R(9) < 0.453. It is interesting 
to remark that this bound is achieved for A = 1195 > N = 1000. This explains why 
the Gaussian approximation in Vapnik's bound can be improved: for such a large 
value of A, \r\{9) does not behave like a Gaussian random variable. 

Let us recall in conclusion that the best bound is provided by Theorem 13.3.31 
(page [125]), giving R{9) < 0.4211, (that is approximately 2/3 of Vapnik's bound), 
for optimal values of k = 15, and of A = 1010. This bound can be seen to take ad- 
vantage of the fact that Bernoulli random variables are not Gaussian (its Gaussian 
approximation, Corollarv l3.4.31 gives a bound R(9) ~ 0.4325, still with an optimal 
k = 15), and of the fact that the optimal size of the shadow sample is significantly 
larger than the size of the observed sample. Moreover, Theorem 13.3.31 does not as- 
sume that the sample is i.i.d., but only that it is independent, thus generalizing 
Vapnik's bounds to inhomogeneous data (this will presumably be the case when 
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data are collected from different places where the experimental conditions may not 
be the same, although they may reasonably be assumed to be independent). 

Our little numerical example was chosen to illustrate the case when it is non- 
trivial to decide whether the chosen classifier does better than the 0.5 error rate 
of blind random classification. This case is of interest to choose "weak learners" 
to be aggregated or combined in some appropriate way in a second stage to reach 
a better classification rate. This stage of feature selection is unavoidable in many 
real world classification tasks. Our little computations are meant to exemplify the 
fact that Vapnik's bounds, although asymptotically suboptimal, as is obvious by 
comparison with the first two chapters, can do the job when dealing with moderate 
sample sizes. 



Chapter 4 

Support Vector Machines 



4.1. How to build them 

4-1.1. The canonical hyperplane 

Support Vecto r Machines, of wide use and renown, were conceived by V. Vapkik 



(jVapnikl . 1 19981 ). Before introducing them, we will study as a prerequisite the sep- 



aration of points by hyperplanes in a finite dimensional Euclidean space. Support 
Vector Machines perform the same kind of linear separation after an implicit change 
of pattern space. The preceding PAC-Bayesian results provide a fit framework to 
analyse their generalization properties. 

In this section we deal with the classification of points in M d in two classes. Let 

Z = (xi,yi)t=i G x {~ !>+!}) De some set of labelled examples (called the 
training set hereafter). Let us split the set of indices I = {1, . . . , N} according to 
the labels into two subsets 

1+ = {i E I : y % = +1}, 
/_ = {i G I : Vi = -l}. 

Let us then consider the set of admissible separating directions 

A z = {w £R d : supinf((u;,2: 4 } - b)y t > l}, 

beK. te/ 

which can also be written as 

Az = {w € IR d : max(w, xA + 2 < min(w, Xi)\. 
1 iei- iei+ J 

As it is easily seen, the optimal value of b for a fixed value of w, in other words the 
value of b which maximizes inf j£j((tu, £<) — b)yt, is equal to 



l r 

2 



max(u>, Xi) + mm. (w,Xi) 

iei- 



Lemma 4.1.1. When Az 7^ 0, inf{||u>|| 2 : w G Az} is reached for only one value 
w z ofw. 
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PROOF. Let w G A z . The set A z n {w G K d : ||w|| < ||wo||} is a compact convex 
set and w \— > ||xt7|| 2 is strictly convex and therefore has a unique minimum on this 
set, which is also obviously its minimum on Az- □ 

Definition 4.1.1. When Az ^ 0, the training set Z is said to be linearly separa- 
ble. The hyperplane 

H = {xeR d : (w z ,x) - b z = 0}, 

where 

w z = argmin{||w|| : w G A z }, 
bz — b wz , 

is called the canonical separating hyperplane of the training set Z. The quantity 
ll^zll -1 is called the margin of the canonical hyperplane. 

As mini e / + (wz, Xi) — maxi £ /_ (wz, Xi) — 2, the margin is also equal to half the 
distance between the projections on the direction wz of the positive and negative 
patterns. 

4-1-2. Computation of the canonical hyperplane 

Let us consider the convex hulls X + and X_ of the positive and negative patterns: 

X + = X iXi : (A^ £/+ 6 <+, J2 ^ = l}, 

ie/ + iei+ 

X_ = {^A^ : (A 4 ) t£/ _ Ai = l}. 

iei- iei- 

Let us introduce the closed convex set 

V = X + — X_ = {x + — x_ : .x + G X + ,x_ G X_}. 

As i) h> ||w|| 2 is strictly convex, with compact lower level sets, there is a unique 
vector v* such that 

K|| 2 = mf{|M| 2 :veV}. 



Lemma 4.1.2. The set Az is non-empty (i.e. the training set Z is linearly sepa- 
rable) if and only if v* ^ 0. In this case 

2 * 
h II 

and the margin of the canonical hyperplane is equal to |||u*||. 

This lemma proves that the distance between the convex hulls of the positive 
and negative patterns is equal to twice the margin of the canonical hyperplane. 

PROOF. Let us assume first that v* = 0, or equivalently that X + n X_ ^ 0. For 
any vector w G K d , 

mm (w,Xi) — mm (w, x), 
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max(w,ii) = max(w,x), 

iei- xeX- 

so mini e / + (w, x%) — maxi e /_ (w, x%) < 0, which shows that w cannot be in Az and 
therefore that Az is empty. 

Let us assume now that v* ^ 0, or equivalently that X + n X_ = 0. Let us put 
w* = 2u*/||w*|| 2 . Let us remark first that 

min(u>*, Xi) — max(u>*, Xi) = inf (w*,x) — swp{w*,x) 
iei+ iei- xex + xeX- 

= inf (w* , x + — X-) 
2 

Let us now prove that mi ve y(v* ,v) — ||w*|| 2 . Some arbitrary v G V being fixed, 
consider the function 

i > \\0v + (1 - 0)v* || 2 : [0, 1] -> R. 

By definition of v*, it reaches its minimum value for (3 — 0, and therefore has 
a non- negative derivative at this point. Computing this derivative, we find that 
(v — v* ,v*) > 0, as claimed. We have proved that 

min(w* , x^ — ma,x(w* , x^ = 2, 
iei+ iei- 

and therefore that w* G A^. On the other hand, any w G Az is such that 
2 < min(w, a^} — max(w, a^} = inf (tu, v) < \\w\\ inf ||u|| = ||tu|| ||u*||. 

iel+ i£i_ veV veV 

This proves that \\w*\\ = inf{||w|| : w G A z }, and therefore that w* = wz as 
claimed. □ 

One way to compute wz would therefore be to compute v* by minimizing 



y ' ^iUiXq 

iei 



: {\i) ieI G R T + , X ' = 2 > V' X ' = f • 



Although this is a tractable quadratic programming problem, a direct computation 
of wz through the following proposition is usually preferred. 

Proposition 4.1.3. The canonical direction wz can be expressed as 



N 



WZ 



X 7 , 



i=l 



where (a^^Lj is obtained by minimizing 



where 



and 



inf{F(a) :ae^} 
.A = {(a l ) te / G R^^a^ = o}, 



2 



iei iei 
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Proof. Let w(a) — J2iei a %Vi x i an d let S(a) = ^J2iei ai - We can ex P ress 
the function F(a) as F(a) = ||w(a)|| 2 — AS(a). Moreover it is important to no- 
tice that for any s G M + , {w(a) : a G A,S(a) — s} = sV. This shows that 
for any s G R+, mi{F(a) : a G A, S(a) = s} is reached and that for any 
a s G {a G A : S(a) = s} reaching this infimum, w(a s ) = sv* . As 
s i > ,s 2 ||u*|| 2 — 4s : M + — > K reaches its infimum for only one value s* of s, namely 
at s* = u^ip , this shows that F(a) reaches its infimum on A, and that for any 
a* G A such that F(a*) = inf{F(a) : a G ^L}, w(a*) = jpj?v* = w Z - □ 

4-1.3. Support vectors 

Definition 4.1.2. The set of support vectors § is defined by 

§ = {x, : (w z ,Xi) -b z = y % }- 
Proposition 4.1.4. Any a* minimizing F(a) on A is such that 

{x, : a* > 0} G S. 

This implies that the representation wz = w(a*) involves in general only a limited 
number of non-zero coefficients and that w z = w z > , where Z' = {(xi,yi) : Xj G §}. 

Proof. Let us consider any given i G 1+ and j G such that a* > and a* > 0. 
There exists at least one such index in each set /_ and I + , since the sum of the 
components of a* on each of these sets are equal and since ^2 keI a* k > 0. For any 
fel, consider 

a k (t) =a* k +tt(ke {i,j}), he I. 

The vector a(t) is in A for any value of t in some neighbourhood of 0, therefore 
]%\ t - F[ a {t)\ = 0- Computing this derivative, we find that 

y l (w(a*),x l ) +y j (w(a*),x j ) = 2. 
As yi — —tjj, this can also be written as 

yi[{w(a*),Xi) -b z ] + yj [(w(a*), xj) -b z ] =2. 

As w(a*) G A z , 

y k [(w(a*),x k ) - b z ] > 1, k £ I, 
which implies necessarily as claimed that 

yi[(w(a*),Xi) - b z ] = yj[(w(a*),Xj) - b z ] =1- 

□ 

4-1. 4- The non- separable case 

In the case when the training set Z = (x{,yi)fL 1 is not linearly separable, we can 
define a noisy canonical hyperplane as follows: we can choose w G R d and b G M to 
minimize 

N 

(4.i) c(w,b) = J2i 1 ~ - b )y*]+ + ^imi 2 , 

i=l 

where for any real number r, r + = max{r, 0} is the positive part of r. 
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Theorem 4.1.5. Let us introduce the dual criterion 



JV JV 



2 

i=l 



and the domain A' = < a G : < 1, z = 1, . . . , N, ^ T/jO;, = >. Let a* G A' 

i=i ' 

be such that F(a*) = sup aeA , F(a). Let w* — J2iLi Vid*Xi. There is a threshold 
b* (whose construction will be detailed in the proof), such that 

C(w*,b*) = inf CO, 6). 

Corollary 4.1.6. (scaled criterion) For any positive real parameter X let us 
consider the criterion 

N 

C x (w,b) = A 2 ]T[1 - ((w, Xi ) - b)y t ] + + - 

and the domain 

A' x = la G : a, < A 2 , i = 1, . . . , N, = °|- 

^ i=i ' 

For any solution a* of the minimization problem F(a*) = sup a£yl ^ F(a), the vector 
w* = J^iLi yi°*i x i * s suc h that 

m{C x (w*,b)= inf G\(w,b). 

bGR w£R d ,b£R 

In the separable case, the scaled criterion is minimized by the canonical hyper- 
plane for A large enough. This extension of the canonical hypcrplanc computation 
in dual space is often called the box constraint, for obvious reasons. 

PROOF. The corollary is a straightforward consequence of the scale property 
C\(w,b, x) — X 2 C(X~ 1 w, b, Xx), where we have made the dependence of the crite- 
rion in x £ R dN explicit. Let us come now to the proof of the theorem. 

The minimization of C(w, b) can be performed in dual space extending the couple 
of parameters (w, b) to w = (w,b,j) G R d x M. x R+ and introducing the dual 
multipliers a G and the criterion 

N N 

G(a,w) = ^2'yi + ^2ai{[l-((w,x i )-b)y i ] -7i} + i' L "' 2 
We see that 



w 

i=l i=l 



C{w,b) — inf sup G\a, (w, b, 7)] , 

-yeR~ aeR N 

and therefore, putting W = {(w, b, 7) : w G R d , b G M, 7 G }, we are led to solve 
the minimization problem 

G(a*,uJ*)= inf_ sup G(a,w), 
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whose solution w* = (w*,6*,7*) is such that C(w*,b*) = inf / w b \ eR d+i C(w,b), 
according to the preceding identity. As for any value of a' E , 

inf_ sup G(a,w) > inf_G(a',uJ), 

»ew« e ij» sew 

it is immediately seen that 

inf_ sup G(a,w) > sup mi_G(a,w). 

We are going to show that there is no duality gap, meaning that this inequality is 
indeed an equality. More importantly we will do so by exhibiting a saddle point, 
which, solving the dual minimization problem will also solve the original one. 

Let us first make explicit the solution of the dual problem (the interest of this 
dual problem precisely lies in the fact that it can more easily be solved explicitly). 
Introducing the admissible set of values of a, 

N 

A' = {a e R N : < a t < l,i = 1, . . . , N, ^ y^a, = 0}, 

i=i 

it is elementary to check that 

f inf G\a, (w, 0,0)1, a E A', 
mi_G(a,W) = \^ 1 n 
weW [— oo, otherwise. 

As 

N 

G[a,(w,0,0)] = i||H| 2 + ^>i(l - (w,Xi) yi ), 

i=l 

we see that mt weR d G[a, (w, 0, 0)] is reached at 

N 

i=l 

This proves that 

inf_G(a,w) = F(a). 
mew 

The continuous map a i— > inf_ £ ^y G(a, w) reaches a maximum a*, not necessarily 
unique, on the compact convex set A'. We are now going to exhibit a choice of 
W* G W such that (a* , W* ) is a saddle point. This means that we are going to show 
that 

G(a* ,w*) = inf_G(a*,w) = sup G{ot,w*). 

It will imply that 

inf_ sup G(a,w) < sup G(a, w*) = G(a* ,w*) 

on the one hand and that 

inf_ sup G(a,w) > inf_G(a*,w) — G(a* ,W*) 
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on the other hand, proving that 



G{a* ,w*) = inf_ sup G{a,w) 

as required. 

Construction of w* . 

• Let us put w* = w a * . 

• If there is j £ {1, . . . , N} such that < a* < 1, let us put 

b* = (xj,w*) - yj. 

Otherwise, let us put 

b* = sup{(x u w*) -l:a*>0,yi = +l,i= 1,. ..,7V}. 

• Let us then put 



7i 



0, a* < 1, 

1 - ((w*,Xi) - b*)y i: a* = 1. 



If we can prove that 



< 0, a* = 0, 

(4.2) I - (Or'.,-;) -!>■)<!, {=0, 0<a*<], 

>0, a| = l, 

it will show that 7* G and therefore that W* = (w*,b*,j*) e W. It will also 
show that 

N 

2 



G(a,W*) = ^ 7 ;+ E ^[l-((W*,^)-6*)y s: ] +i 



211™ 



proving that G(a* , w*) — sup QeK Jv G(a,w*). As obviously G(a* , w*) — G\a* , (w* , 

0,0)], we already know that G(a* 7 w*) — inf_ e ^y G(a*, w). This will show that 
(a*,w*) is the saddle point we were looking for, thus ending the proof of the theo- 
rem. □ 

PROOF of equation (|4.2j) . Let us deal first with the case when there is j € 
{1, . . . , N} such that < a* < 1. 

For any i £ {1, . . . , N} such that < a* < 1, there is e > such that for any 
t G (— e, e), a* +tyiei — tyjej £ A', where (ek)^ =1 is the canonical base of M. N . Thus 
JL 

dt |t= 



— _ Q F(a* + <2/ie; — tyj e j) = 0. Computing this derivative, we obtain 



d 

— F(a* +ty i e l - tyjej) =y % - (w*,Xi) + (w ,Xj) - yj 
at \t=o 

= Vi[ l ~ ((w,Xi) - b*)yi\. 

Thus 1 — ((w, Xi) — b*)yi — 0, as required. This shows also that the definition of b* 
does not depend on the choice of j such that < a* < 1. 
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For any i £ {1, . . . , N} such that a* — 0, there is e > such that for any 
t £ (0, e), a* +ta - tyiyjej £ A'. Thus §i\ t=0 F{a* + te t - ty^jCj) < 0, showing 

that 1 — ((w*,Xi) — b*)yi < as required. 

For any i £ {1, . . . , N} such that a* = 1, there is e > such that a* — ta + 
tyiDjZj G A'. Thus -^^_ Q F(a* — tei + ty^yjej) < 0, showing that 1 — ({w*, Xi) — 

°*)Vi > as required. This shows that (a*,w*) is a saddle point in this case. 

Let us deal now with the case where a* £ {0, 1}^. If we are not in the trivial case 
where the vector (yi)fL 1 is constant, the case a* = is ruled out. Indeed, in this 
case, considering a* + tei + tej, where yiyj = —1, we would get the contradiction 

2 = & |t=0 i ; V + te i + te j )<o. 

Thus there are values of j such that a* = 1, and since Y^H=i a iDi = 0; both 
classes are present in the set {j : a* = 1}. 

Now for any i, j £ {1, . . . , TV} such that a* = a* = 1 and such that y t = +1 and 
yj = -1, -§t lt=Q F(a* - te l - tej) = -2 + (w*,Xi) - (w*,Xj) < 0. Thus 

snp{(w*,Xi) - 1 : a* = = +1} < mf{(w*,Xj) + 1 : a) = 1, yj = -1}, 
showing that 

l-((w*,x k )-b*)y k >0,at = l. 
Finally, for any i such that a* = 0, for any j such that a* — 1 and j/j = y^, we have 

— + te, - tej) = y i {w*,x l - Zj) < 0, 

ot |t=0 

showing that 1 — ((w*, a^) — < 0. This shows that (a* ,w*) is always a saddle 
point. 



4-1.5. Support Vector Machines 



Definition 4.1.3. The symmetric measurable kernel K : X x X — > R is said to be 
positive (or more precisely positive semi- definite) if for any n £ N, any (#j)f =1 £ 



i=i j=i 

Let Z = {xi,yi)^ =1 be some training set. Let us consider as previously 

A = la £ R^ a iVi = } ■ 



i=i 



Let 

N N N 

F ( a ) = ^2'^0iiy i K(xi,x j )y j a j - 2^aj. 

z— 1 j — 1 2=1 



Definition 4.1.4. Le£ K be a positive symmetric kernel. The training set Z is said 
to be K -separable if 

mf{F(a) : a £A} > -co. 
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Lemma 4.1.7. When Z is K-separable, mf{F(a) : a G A} is reached. 
Proof. Consider the training set Z' = (x' i ,yi)^. 1 , where 

N N 



f 1 IV j\ 

<^ K(x k ,x e ) \ 
I J k=i,i=i 



1/2 s N 

> 3=1 



We see that F(a) = \\J2iLi a iUi x i II 2 — 2 J2iLi a i- We proved in the previous section 
that Z 1 is linearly separable if and only if ini{F(a) : a G A} > — oo, and that the 
infimum is reached in this case. □ 

Proposition 4.1.8. Let K be a symmetric positive kernel and let Z — (xi,yi)^ =1 
be some K-separable training set. Let a* G A be such that F(a*) = mf{F(a) : a G 
A}. Let 

/* = {i G N : 1 < i < N,y { = -l,a* > 0} 
/; = {i£N:l<i< N lVl = +l,a* > 0} 

b* = -^2 i a*y j K{x j ,x i _) +'Y^o* j y j K{x j ,x i+ )^, i- G G /+, 

where the value ofb* does not depend on the choice of i- and i + . The classification 
rule f : X — > ^ defined by the formula 



f{x) = sign ^2 ot*yiK{x t , x) - b*^j 



is independent of the choice of a* and is called the support vector machine defined 
by K and Z . The set § = {xj : J^ili a *yiK( x i> x j) — b* = yj} is called the set of 
support vectors. For any choice of a* , {xi : a* > 0} C §. 



An important consequence of this proposition is that the support vector machine 
defined by K and Z is also the support vector machine defined by K and Z' = 
{(xi.yi) : a* > 0, 1 < i < N} 7 since this restriction of the index set contains the 
value a* where the minimum of F is reached. 

PROOF. The independence of the choice of a* , which is not necessarily unique, 
is seen as follows. Let {xi)f =1 and x G X be fixed. Let us put for ease of notation 
Xjv+i = x. Let M be the (N + 1) x (N + 1) symmetric semi-definite matrix defined 
by M(i,j) — K(xi, Xj), i = 1, . . . , N + 1, j = 1, . . . , N + 1. Let us consider the 
mapping * : {x t : i = 1, . . . , iV + 1} — ► R N+1 defined by 

(4.3) *(xi)=[M 1 / 2 (i,j)]"+ 1 eR N+1 . 

Let us consider the training set Z' = [^>(x i ),y i ] 1 \'_ 1 . Then Z' is linearly separable, 

II N 2 N 

F(a) = \\^2a l y^(x l ) -2^a 4 , 



and we have proved that for any choice of a* G A minimizing F(a), 
w z> = J2iLi a iVi^'( x i)- Thus the support vector machine defined by K and Z 
can also be expressed by the formula 

f(x) = sign (w Z ',^(x)) - b z >] 
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which does not depend on a*. The definition of S is such that \I/(S) is the set of 
support vectors defined in the linear case, where its stated property has already 
been proved. □ 

We can in the same way use the box constraint and show that any solution 
a* G arg min{-F(a) : a € A, 04 < A 2 , i = 1, . . . , N} minimizes 



N r / N \ 

(4.4) inf A 2 Y, 1 - ( Y yj a j K ( x J > x i)- b )y 



. ff JV 

9 Y Y a l a J y i y j K(x l , x 3 ). 



i=i j=i 



4-1.6. Building kernels 



Except the last, the results of this section are drawn from lCristianini et al. I (|2000l) 



We have no reference for the last proposition of this section, although we believe it 
is well known. We include them for the convenience of the reader. 

Proposition 4.1.9. Let K\ and K 2 be positive symmetric kernels on X. Then for 
any a € R+ 

(aKx+^ix,^) = aKi{x,a/) +K 2 (x,x') 
and (K\ ■ K2)(x,x') = Ki{x, x')K 2 {x, x') 
are also positive symmetric kernels. Moreover, for any measurable function 

dcf 

g : X — > M., K g (x,x') = g(x)g(x') is also a positive symmetric kernel. 

Proof. It is enough to prove the proposition in the case when X is finite and 
kernels are just ordinary symmetric matrices. Thus we can assume without loss of 
generality that X = {1, . . . , n). Then for any a G WL N , using usual matrix notation, 

(a, (aKi + K%)(x) = a(a, K%a) + (a, Ki®) > 0, 

(a,(Ki ■ K 2 )a) = ^ ajKi (i, j)K 2 (i, j)atj 

= J2^ K l /2 (i^)K 1 1 /2 (k,j)K 2 (i,j)a j 

= ^^[Kl /2 (k,i)ou]mj)[Kl^(kJ)aj] > 0, 

k i,j 

>0 



(a, K g a) 



i,j \ i / 



□ 



Proposition 4.1.10. Let K be some positive symmetric kernel on X. Let p : M — * 

R be a polynomial with positive coefficients. Let g : X — » R d be a measurable func- 
tion. Then 

p(K)(x,x')&p[K{x,x')], 



4-.1. How to build them 



141 



exp(K )(x, x') ^ exp[K(x, x')] 



and G g (x,x') = exp(-||5(a:) - g(x')\\ 2 ) 
are all positive symmetric kernels. 

Proof. The first assertion is a direct consequence of the previous proposition. The 
second comes from the fact that the exponential function is the pointwise limit of a 
sequence of polynomial functions with positive coefficients. The third is seen from 
the second and the decomposition 

G g (x,x') = [exp(-||.g( 2 ;)|| 2 )exp(-|| 5 ( a ;')l| 2 )] e X p[2{g(x), g{x')}] 

□ 



Proposition 4.1.11. With the notation of the previous proposition, any training 
set Z — (x{, Ui)f = i 6 (lx { — 1, +1}) is Gg-separable as soon as g(xi), i = 1, . . . , TV 
are distinct points of M. d . 

Proof. It is clearly enough to prove the case when X = M. d and g is the identity. 
Let us consider some other generic point xjv+i e R d and define iff as in (|4.3[) . It is 
enough to prove that ^(xi), , . . , W(xjv) are affine independent, since the simplex, 
and therefore any affine independent set of points, can be split in any arbitrary way 
by affine half-spaces. Let us assume that (xi, . . . ,xn) are affine dependent; then 
for some (Ai, . . . , Xn) ^ such that Yli=i ^« = 0, 

N N 

\iG(xi,xj)\j = o. 

i=i j=i 

Thus, (Xi) 1 ^ 1 , where we have put Aat + i = is in the kernel of the symmetric 
positive semi-definite matrix G(xi,Xj)ij^sx,...,N+i}- Therefore 

N 

XiG(xi, xn+i) = 0, 

i=l 

for any xn+i G R d - This would mean that the functions x i— > exp(— \\x — Xi\\ 2 ) 
are linearly dependent, which can be easily proved to be false. Indeed, let n € M. d 
be such that ||n|| = 1 and (n,Xi), i = 1, ... ,7V are distinct (such a vector exists, 
because it has to be outside the union of a finite number of hyperplanes, which is 
of zero Lebesgue measure on the sphere). Let us assume for a while that for some 
(Ai)^i S R N , for any x G M d , 

N 

^A. i exp(-||a;-a; i || 2 ) = 0. 

i=l 

Considering x — tn, for t € KL, we would get 

JV 

^ A, exp(2t(n, a;,} - ||a; 2 || 2 ) = 0, t e M. 

Letting t go to infinity, we see that this is only possible if Ai = for all values of i. 
□ 
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4.2. Bounds for Support Vector Machines 
4-2.1. Compression scheme bounds 

We can use Support Vector Machines in the framework of compression schemes 
and apply Theorem 13 . 3.31 (page 1 1 2 5)1 . More precisely, given some positive symmetric 
kernel K on X, we may consider for any training set Z 1 = (x' i ,y' i )^ =1 the classifier 
fz> ■ X — + y which is equal to the Support Vector Machine defined by K and Z 1 
whenever Z' is A-separable, and which is equal to some constant classification rule 
otherwise; we take this convention to stick to the framework described on page 1 117] 
we will only use fz> in the AT-separable case, so this extension of the definition is just 
a matter of presentation. In the application of Theorem 13 . 3 .31 in the case when the 
observed sample (Xj, Yi)^ =l is A-separable, a natural if perhaps sub-optimal choice 
of Z' is to choose for (x^) the set of support vectors defined by Z — (Xi, Yi)f =l and 
to choose for (y^) the corresponding values of Y. This is justified by the fact that 
fz = fz 1 , as shown in Proposition 14.1.81 (page 1139ft . If Z is not A-separable, we 
can train a Support Vector Machine with the box constraint, then remove all the 
errors to obtain a A"-separable sub-sample Z' — {(Xi,Yi) : a* < A 2 ,l < i < N}, 
using the same notation as in equation (|4.4[) on page 11401 and then consider its 
support vectors as the compression set. Still using the notation of page 11401 this 
means we have to compute successively a* G argmin{i 7 '(a) : a G A, on < A 2 }, and 
a** G argmin{A(a) : a G A, on — when a* = A 2 }, to keep the compression set 
indexed by J = {i : 1 < i < N 7 a** > 0}, and the corresponding Support Vector 
Machine fj. Different values of A can be used at this stage, producing different 
candidate compression sets: when A increases, the number of errors should decrease, 
on the other hand when A decreases, the margin of the separable subset Z 1 

increases, supporting the hope for a smaller set of support vectors, thus we can use A 
to monitor the number of errors on the training set we accept from the compression 
scheme. As we can use whatever heuristic we want while selecting the compression 
set, we can also try to threshold in the previous construction a** at different levels 
7? > 0, to produce candidate compression sets J v = {i : 1 < i < N 7 a** > 77} of 
various sizes. 

As the size | Jj of the compression set is random in this construction, we must 
use a version of Theorem 13.3.31 (page I125P which handles compression sets of arbi- 
trary sizes. This is done by choosing for each k a fc-partially exchangeable posterior 
distribution 7Tfc which weights the compression sets of all dimensions. We immedi- 
ately see that we can choose iik such that — log[7r/ c (Afc(J))] < log[|J|(|J| + 1)] + 



(k+l)eN 
\J\ 



\J\ log 

If we observe the shadow sample patterns, and if computer resources permit, we 
can of course use more elaborate bounds than Theorem l3.3.3[ such as the transduc- 
tive equivalent for Theorem 11.3.151 (page 131 j) (where we may consider the submod- 
els made of all the compression sets of the same size). Theorems based on relative 
bounds, such as Theorem 12.2.41 (page 173^) or Theorem 12.3.91 fpage ll08[) can also be 
used. Gibbs distributions can be approximated by Monte Carlo techniques, where 
a Markov chain with the proper invariant measure consists in appropriate local 
perturbations of the compression set. 

Let us mention also that the use of compression schemes based on Support Vector 
Machines can be tailored to perform some kind of feature aggregation. Imagine that 
the kernel K is defined as the scalar product in £2(71"), where it £ M^(9). More 
precisely let us consider for some set of soft classification rules {fg : X — > R;9 G 0} 
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the kernel 



K{x,x') 



f e (x)f e (x')n(d6). 



e<-> 



In this setting, the Support Vector Machine applied to the training set Z = (xi, 
Ui)iLi nas the form 

fz{x) = sign / f e {x)y^ y i a i f e {x i )'K(de) - b 
\Jeee l=1 J 

and, if this is too burdensome to compute, we can replace it with some finite 
approximation 



/ 1 m 

fz(x) = sign — V/e fc ( 
\ m z — ' 

\ fc=i 



where the set {9k, k = 1, . . . , m} and the weights {iVk, k = 1, . . . , m} are computed 
in some suitable way from the set Z' = (li, 2/»)»,ai>o of support vectors of fz- For 
instance, we can draw {9k, k = 1, . . . ,m} at random according to the probability 
distribution proportional to 



JY 



?r(d0), 



define the weights Wk by 



= sign y2yiaife k (xi) \ 



iV 



7T(d0), 



and choose the smallest value of m for which this approximation still classifies Z 1 
without errors. Let us remark that we have built fz in such a way that 

lim f z (xi) = fz(xi) = yu a.s. 

m — >+oc 

for any support index i such that 014 > 0. 

Alternatively, given Z' , we can select a finite set of features 8' C & such that Z' 
is Kqi separable, where K&'(x, x') — J2eeQ> fe{x)fe{x') and consider the Support 
Vector Machines fz* built with the kernel Kqi . As soon as & is chosen as a function 
of Z 1 only, Theorem l3.3.3l fpage !125p applies and provides some level of confidence 
for the risk of fz> ■ 



4-2.2. The Vapnik-Cervonenkis dimension of a family of subsets 

Let us consider some set X and some set S C {0, 1} X of subsets of X. Let h(S) be 
the Vapnik-Cervonenkis dimension of S, defined as 

h(S) = max||A| : A C X, \A\ < 00 and A n S = {0, 1} A |, 

where by definition Af] S = {An B : B 6 S} and \A\ is the number of points in A. 
Let us notice that this definition does not depend on the choice of the reference set 
X. Indeed X can be chosen to be \J S, the union of all the sets in S or any bigger 



144 



Chapter 4- Support Vector Machines 



set. Let us notice also that for any set B, h(B n S) < h(S), the reason being that 
An(BDS) = BD(AnS). 

This notion of Vapnik-Cervonenkis dimension is useful because, as we will see 
for Support Vector Machines, it can be computed in some important special cases. 
Let us prove here as an illustration that h(S) = d + 1 when X = R d and S is made 
of all the half spaces: 

S = {A wfi : w £ R d , b £ K}, where A wfi = {x £ X : (w, x) > b}. 



Proposition 4.2.1. With the previous notation, h(S) = d + 1. 

PROOF. Let (e^)^ 1 be the canonical base of E d+1 , and let X be the affine subspace 
it generates, which can be identified with R d . For any (fi)^ 1 £ {— l,+l} d+1 , 
let w — X^i=i e i e i an d b = 0. The half space A w _^ n X is such that {ei\i = 
1, . . . , d + 1} n (A Wib fll) = {et ; a = +1}. This proves that h(S) > d + 1. 

To prove that /i(5) < d + 1, we have to show that for any set A C K d of size 
\A\ = d + 2, there is B C A such that 5 ^ (An 5). Obviously this will be the case if 
the convex hulls of B and A\B have a non-empty intersection: indeed if a hyperplane 
separates two sets of points, it also separates their convex hulls. As \A\ > d + 1, 
A is affine dependent: there is (X x ) x< za £ K d+2 \ {0} such that J2xeA XxX ~ an d 
SxeA ^ = 0- The set B — {x £ A : X x > 0} and its complement A \ B arc non- 
empty, because J2 x eA ^ = and A ^ 0. Moreover J2 x eB x x = J2 x ea\b > °- 
The relation 

— — X * x = T~ ~ X * X 

2^xeB A =c xeB 2^ x eB A x xeAXB 

shows that the convex hulls of B and A \ B have a non-void intersection. □ 
Let us introduce the function of two integers 



fc=0 

which can alternatively be defined by the relations 



2™ when n < h, 

+ ®n-i wnen n> h. 



Theorem 4.2.2. Whenever \JS is finite, 

\S\<$(\{Js\,h(S)). 

Theorem 4.2.3. For any h < n, 

d>£ < exp[ntf(£)] <exp[fc(log(f) + l)], 

where H{p) = — p\og(p) — (1 — p) log(l — p) is the Shannon entropy of the Bernoulli 
distribution with parameter p. 
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PROOF of theorem 14.2.21 Let us prove this theorem by induction on |U<S1- It 
is easy to check that it holds true when HJSI = 1. Let X = [J S, let x € X and 
X' = X \ {x}. Define (A denoting the symmetric difference of two sets) 

S' = {A G S : A A {x} G S}, 
S" = {A G S : A A {x} S}. 

Clearly, U denoting the disjoint union, S = S'US" and SnX' = (S'nX')U(S"nX'). 
Moreover \S'\ = 2\S' n X'\ and \S"\ = \S" n X'\. Thus 

\s\ = \s'\ + \s"\ =2\s'n x'\ + \s"\ = \sn x'\ + \s' n x'\. 

Obviously h(SnX') < h(S). Moreover h(S'nX') = h(S')-l, because if A C X' is 
shattered by S' (or equivalently by S'CiX'), then Au{a;} is shattered by S' (we say 
that A is shattered by S when ACiS = {0, 1} A ). Using the induction hypothesis, we 
then see that \SnX'\ < $^f? +$f l H )_1 . But as \X'\ = \X\ - 1, the right-hand side 
of this inequality is equal to according to the recurrence equation satisfied 

by 

PROOF of theorem I4.2.3E This is the well-known Chernoff bound for the 
deviation of sums of Bernoulli random variables: let (<7i, . . . , o~ n ) be i.i.d. Bernoulli 
random variables with parameter 1/2. Let us notice that 



For any positive real number A , 

n \ r / n 

<Tj < h J < exp(A/i)E expl — A^^ Oi 
i=l ' L \ i=1 

= exp | Aft, + nlog{E[exp(-Aai)] }}. 

Differentiating the right-hand side in A shows that its minimal value is 
cxp[-n3C(£, |)] , where X(p, q) = plog(|) + (1 -p) log(j5|) is the Kullback diver- 
gence function between two Bernoulli distributions B p and B q of parameters p and 
q. Indeed the optimal value A* of A is such that 

E[a 1 cxp(-AV 1 )] 
E[exp(-A*(Ti)J 

Therefore, using the fact that two Bernoulli distributions with the same expecta- 
tions are equal, 

log{E[exp(-AV 1 )] } = -X*B h/n (a 1 ) - X(B h/ni B 1/2 ) = -A*£ - i). 
The announced result then follows from the identity 

H(p) = log(2) - X(p, i) 

= plog^" 1 ) + (1 -p) log(l + -?-) < ^[log^- 1 ) + 1] . 

I — p 
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4-2.3. Vapnik-Cervonenkis dimension of linear rules with margin 



T he proof of the fol l owing theorem was suggested to us by a similar proof presented 
in lCristianini et al.l ( 20001 ). 



Theorem 4.2.4. Consider a family of points (xi, . . . , x n ) in some Euclidean vec- 
tor space E and a family of affine functions 



where 



'K = {g Wtb : E -» R;w G E, \\w\\ = 1,6 e K}, 



gw,b{x) = (w, x) — b, x G E. 



Assume that there is a set of thresholds (6i)f =1 £ l n smc/i that for any 
(yi)™—l G {— 1,+1}™, there is g w ,b G 'M such that 



i—l 

Let us also introduce the empirical variance of (xi)f =1} 

^ n \ n 

Var(xi, . . . , x n ) = - y~] Xi V" : 



i=l 



In this case and with this notation, 



(4.5) 



Varfoi, ...,x n ) 
7 2 



> 



n — 1 



w/ien n is even, 
when n is odd. 



Moreover, equality is reached when 7 is optimal, hi = 0, i = 1, . . . , n and (x\, . . . , 
x n ) is a regular simplex (i.e. when 2j is the minimum distance between the convex 
hulls of any two subsets of {x±, . . . , x n } and \\xi — Xj || does not depend on i ^ j). 

Proof. Let (si)™=i G K ra be such that Y^h=i Si = 0- CT be a uniformly distributed 
random variable with values in & n , the set of permutations of the first n integers 
{1, . . . , n}. By assumption, for any value of a, there is an affine function g w ^ G 9£ 
such that 

min [g Wl b(xi) - h] [21 (s^) > 0) - l] > 7. 

As a consequence 

I n \ n n 

\ E s <y{t) x ii w ) = E M») (( Xi > w ) - b - b i) + ^2 

\i=l / i=\ i=l 

n 

i=l 

Therefore, using the fact that the map x 1— ► I max{0,a;} j is convex, 



E 



> E 
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> 



maxi 0,^7E(|s ff(i) |) +E(s tr(i) )6 i 



= 7 



E 



-ill , 



t=i 



where E is the expectation with respect to the random permutation a. On the other 
hand 



E 

Moreover 



E s -«^ = E E ( 



4 W )IWI 2 



2jE(s (7(i )S (T ( i ))(a;i,a;j). 



In the same way, for any i ^ j, 

E (s<7(i)S<rCn) 



^<<,> 4 E (£ »?<<,) 4 E 

\i=l / i=l 

— -E I V, 

n(n - 1) I ^ 

rafra - 1) E SiS J 



n(n — 1) 



EH -£• 



Thus 



E 



£• 



We have proved that 



£• 

\i=l 
' n 

£■ 



i _ 

l 

n(n ~ 1) 



1 " 

~£iw 



i 



i 



n n(n — 1) 



/ i \ / .(^ii x j) 

n(n — 1) ^-r ' 



£iwi 5 



n(n — 1) 

T (£ S * ) Var(xi,...,x n ). 



Vax(xi, 



. (n-l)(£>|Y 

n) 



> \ ~* 2 

t=i 

This can be used with = l(i < ^) — 1(« > in the case when n is even and 
Si = ( n -i) — 2i lT") ~ nTT-"-(* > 2i lT") m tne case wnen n i s °dd, to establish the 
first inequality (|4.5|) of the theorem. 
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Checking that equality is reached for the simplex is an easy computation when 
the simplex (xi)f =l € (R n ) n is parametrized in such a way that 

x .{j) = J 1 ifi= ^ 

1 otherwise. 

Indeed the distance between the convex hulls of any two subsets of the simplex is 
the distance between their mean values (i.e. centers of mass). □ 



4-2.4- Application to Support Vector Machines 

We are going to apply Theorem 14.2.41 (page 1 146[) to Support Vector Machines in 
the transductive case. Let (Xi,Yi)^ 1)N be distributed according to some partially 

exchangeable distribution P and assume that (X,)£l 1)Ar and (F 4 \=i are observed. 
Let us consider some positive kernel K on X. For any if-separable training set of 
the form Z' = (X;, yQ^ 1)JV , where (y'i)^" G ^ k +^ N , let f z , be the Support 
Vector Machine defined by K and Z' and let j(Z') be its margin. Let 

(fe+l)JV (fc+l)JV 



i=i,...,(fc+i)jv y ' ' (k + l) 2 N 2 

j=l k=l 



(k+l)N 



(k + l)N 

j=i 

This is an easily computable upper-bound for the radius of some ball containing 
the image of (X±, . . . , X^ + i^) in feature space. 
Let us define for any integer h the margins 

r / 1 \T 1/2 

(4.6) l2h = (2h - 1)- 1/2 and j 2h+1 = \2h [l - (2fe + 1)2 j 

Let us consider for any h = 1, . . . , N the exchangeable model 

%h = {h> ■ Z' = (X l ,y' l )' i k J i 1)N is if -separable and 7 (Z') > R lh ). 

The family of models 3?/ l7 h = 1, . . . , N is nested, and we know from Theorem l4.2.4l 
(page EES)) and Theorems ET2T21 (page [TI4"]) and |4~231 (page [TI4]) that 

log(|34|)<Mog(Mi^). 

We can then consider on the large model = |Jh=i -^h (the disjoint union of the 
sub-models) an exchangeable prior n which is uniform on each 3^ and is such that 
7r(CR/j,) > jjj^pj- Applying Theorem 13 . 2 . 31 fpage we get 

Proposition 4.2.5. With P probability at least 1 — e, /or any h = 1, . . . , N, any 

Support Vector Machine f £ 



ra(/) < 

* + l l-cxp[-A ri(/) _^i og (_(_+i 
inf 



JV^, _ log[fe(/l+l)]-lQg(6) 

JV 



k AGR+ 1 - exp(--A) 
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Searching the whole model Hu to optimize the bound may require more computer 
resources than are available, but any heuristic can be applied to choose /, since the 
bound is uniform. For instance, a Support Vector Machine /' using a box constraint 
can be trained from the training set (Xi,Yi)?L 1 and then (y'{)\^ can be set to 
y'i = sign(/'(X,)), i = 1, . . • , (k + l)N. 



4-2.5. Inductive margin bounds for Support Vector Machines 



In order to establis h inductive margin bounds, we will need a different combinatorial 
lemma. It is due to lAlon et al.l (|1997l ). We will reproduce their proof with some tiny 
improvements on the values of constants. 

Let us consider the finite case when X = {l,...,n}, y = {1, ...,&} and 
b > 3. The question we will study would be meaningless when b < 2. Assume 
as usual that we are dealing with a prescribed set of classification rules 31 = { / : 
X — > y}. Let us say that a pair (A, s), where A C X is a non-empty set of shapes 
and s : A — > {2, . . . , b — 1} a threshold function, is shattered by the set of func- 
tions F C 3i if for any (cr x )xeA G { — 1, +1}' 4 , there exists some / G F such that 
min^e^ a x [f{x) - s(x)] > 1. 

Definition 4.2.1. Let the fat shattering dimension of (X, 31) be the maximal size 
\A\ of the first component of the pairs which are shattered by 51. 

Let us say that a subset of classification rules F C y x is separated whenever for 
any pair (f,g) G F 2 such that / ^ g, \\f - g^ = max xeX \f(%) ~ g(x)\ > 2. Let 
Wl(3V) be the maximum size \F\ of separated subsets F of CR. Note that if F is a 
separated subset of 3? such that \F\ — 9Jt(3l), then it is a 1-net for the £oo distance: 
for any function /el there exists g G F such that ||/ — <?||oo < 1 (otherwise / 
could be added to F to create a larger separated set). 

Lemma 4.2.6. With the above notation, whenever the fat shattering dimension 
of (X, 31) is not greater than h, 

log [971(31)] < Iog[(6 - 1)(6- 2)n]| 1 ° s[E ^^ ) (b ~ 2r] + 1 j + log(2) 
<log[(6-l)(6-2)n] 



log 



(b-2)n 
h 



log(2) 



1 + log(2) 



Proof. For any set of functions F C y x , let i(F) be the number of pairs (A, s) 
shattered by F. Let t(m, n) be the minimum of t{F ) over all separated sets of 
functions F C y x of size |F| = m (n is here to recall that the shape space X is 
made of n shapes). For any m such that t(m, n) > 53j=i (")(& — 2) 1 , it is clear that 
any separated set of functions of size \F\ > m shatters at least one pair (A, s) such 
that \A\ > h. Indeed, from its definition t(m, n) is clearly a non-decreasing function 

of m, so that t(\F\,n) > J^Li (7>( b - 2 Y- Moreover there are only ^=1 (l)( b ~ 2 Y 
pairs (^4, s) such that \A\ < ft. As a consequence, whenever the fat shattering 
dimension of (X, 3?) is not greater than h we have 971(31) < m. 
It is clear that for any n > 1, t(2, n) = 1. 

Lemma 4.2.7. For any m > 1, tlmn (6 — 1)(6 — 2),n] > 2t[m,n — l] , and therefore 
t[2n{n-l)---(n-r + l){b-l) r {b-2) r ,n]>2 r . 
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Proof. Let F = {/i, . . . , / ron (6-i)(6-2)} be some separated set of functions of size 
mn{b — 1)(6 — 2). For any pair (/2i-i,/2i), i = 1, ... ,mn(b — 1)(6 — 2)/2, there is 
ii 6 1 such that \f2i-i(xi) — f2i{xi)\ > 2. Since |X| = n, there is x E X such that 
£mn(b-i)(i-a)/a 1( - a; . = a-j > m ( 6 _ _ 2 )/2. Let I = {i : Xi = a;}. Since there 
are (6 — 1)(6 — 2)/2 pairs (2/1,2/2) E y 2 such that 1 < yi < j/2 — 1 < b — 1, there 
is some pair (2/1,2/2), such that 1 < j/i < 2/2 < b and such that Eie/ , 2/2} = 
{/2j-i(a:),/2i(a;)}) > "2. Let J = {i E / : {f 2 i-i(x), f 2l (x)} = {2/1,2/2}}- Let 



Fi = {/2i-l : i £ J, f2i-l(x) = 2/1} U {/ 2j : 
F 2 = {/2i-i : i £ J, f2i-i(x) = 1/2} U {/ 2 i 



« £ J,h%{x) = 2/1}, 
i £ JJa{x) =2/2}- 



Obviously |Fi| = |F 2 | = \J\ = m. Moreover the restrictions of the functions of F\ to 
X\{x} are separated, and it is the same with F 2 . Thus Fi strongly shatters at least 
t(m, n — 1) pairs (A, s) such that A C X \ {x} and it is the same with F 2 . Finally, 
if the pair (A, s) where Acl \ {x} is both shattered by Fi and F 2 , then Fi U F 2 
shatters also (A U {x}, s') where s'(x') — s(x') for any x' £ A and s'(x) = [ Vl + y2 j . 
Thus F 1 ! U F 2 , and therefore F, shatters at least 2t(m, n — 1) pairs (A, s). □ 

Resuming the proof of lemma 14.2.61 let us choose for r the smallest integer such 
that 2 r > Yli=i (™)(^ — 2) 1 , which is no greater than 

Mel g)^z__] 

In the case when 1 < rt < r, 



1 



log(0rt(3l)) < |X|log(|y|) = nlog(6) < rlog(ft) < r log[(6 - 1)(6 - 2)n] +log(2), 
which proves the lemma. In the remaining case n > r, 



t[2n r (b-l) r (b-2) r ,n] 

> t[2n(n - 1) . . . (n - r + 1)(6 - l) r (6 - 2) r , 



>£(>-«• 

i=l 



Thus |S!Jl(D?)| < 2 (6 - 2) (6 - l)n 



as claimed. □ 



In order to apply this combinatorial lemma to Support Vector Machines, let us 
consider now the case of separating hyperplanes in E rf (the generalization to Support 
Vector Machines being straightforward). Assume that X = K d and ^ = {— 1, +1}. 
For any sample (X)^ 1 ^ , let 

R(x[ k+1)N ) = max{||X l || : 1 < i < (k + 1)N}. 

Let us consider the set of parameters 

9 = {(w,b) E R d x R : |H| = l}. 

For any (w, b) E 0, let g w .b(x) = (w,x) — b. Let h be some fixed integer and let 
7 = i?(vj fe+1 ^ JV )7/ l , where 7^ is defined by equation (|4.61 page ll48[ ). 
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Let us define £ 



by 



CM = 



' -5 


when 


r 


< -47, 


-3 


when 


—47 <r 


< -27, 


-1 


when 


-27 <r 


<o, 


+1 


when 


<r 


<2 7 , 


+3 


when 


2-f <r 


< 47, 


.+5 


when 


47 <r 





Let G Wi b(x) — C[9w,b( x ) \ ■ The fat shattering dimension (as defined in 14.2. ip of 
(lf +1)f, ,{(G m , k + 7)/2:( TO ,i)ee}' 



is not greater than h (according to Theorem 14.2.41 page I146|) . therefore there is 



some set 9" of functions from X 



log(|51) <log[20(fc+l)JV]- 



(k+l)N 



to {—5, —3, —1, +1, +3, +5} such that 



log(2) 



log 



4(fc + l)N 
h 



log(2). 



and for any (w, b) € 0, there is f w _b S 9 such that sup{|/ u)j f,(X i ) — G Wtb (Xi)\ : i = 
1, . . . , (k + < 2. Moreover, the choice of f Wtb may be required to depend on 

(Xi)\^ in an exchangeable way. Similarly to Theorem 13.2.31 (page l!17( ). it can 
be proved that for any partially exchangeable probability distribution P G Mi_(fi), 
with P probability at least 1 — e, for any f Wtb G 9", 



(k+l)N 

- J2 ^[fv,,b{Xi)Yi<l] 



kN 



i=N+l 



exp 



f, [/ „ ((w , 1] _*tMl] 

i=l J J 



1 " 

-^E*[W*)*;<i]. 



i=l 



Let us remark that 
1 



{2l[g w , b (Xi) > 0] - 1 ^Yt} = t[G w , h {Xi)Yi < 0] < if/^pQ)^ < l] 



and 



l[f w , b (Xi)Yi < 1] < l[G M , 6 (l,)y t < 3] < l[^,6(^i)^ < 4 7 ]. 
This proves the following theorem. 



Theorem 4.2.8. Let us consider the sequence ('jh^heN* defined by equation ^4 
page \14S\ l- With P probability at least 1 — e, for any (w, b) € ; 
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(k+l)N 



l{2l[5„,6(^i)>0] -Ij^Yi} 

' ' ^ " i 

- inf [l-exp(-A)]- 1 |i_ 



i=N+l 
k 



exp 



N 



N 2 



i=l 



log[20(fc + 1)N] log(M^) + l} + log 



2/t(/i+l) 



2V 



1 w 



Properly speaking this theorem is not a margin bound, but more precisely a margin 
quantile bound, since it covers the case where some fraction of the training sample 
falls within the region defined by the margin parameter 7^ which optimizes the 
bound. 

As a consequence though, we get a true (weaker) margin bound: with P proba- 
bility at least 1 — e, for any (w,b) G such that 



7 = min g w . b {Xi)Yi > 0, 

i—l,...,N 



(fc+l)JV 

— ^ l^pQ)^ <0] 



kN 



i=N+l 



< ^±1 <j 1 - exp 



log[20(fc+l)jv] f 16fl 2 +27 2 I 

jv \ io g (2) 7 ^ l0 §^ — im — ) + L j 



This i nequality compares favourably with similar inequalities in iCristianini et al 
(2000), which moreover do not extend to the margin quantile case as this one. 

Let us also mention that it is easy to circumvent the fact that R is not observed 
when the test set Xj^ +1 is not observed. 

Indeed, we can consider the sample obtained by projecting x[ k+1 ^ N on some 
ball of fixed radius i? m ax, putting 



^max(^) = min \ 1) 



Xi 



We can further consider an atomic prior distribution v G M+(R+) bearing on R max , 
to obtain a uniform result through a union bound. As a consequence of the previous 
theorem, we have 



Corollary 4.2.9. For any atomic prior v G Mi_(K-|_), for any partially exchange- 
able probability measure P G M;j_(f2), with P probability at least 1 — e, for any 



(w, b) e e, any i? r 



G 
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(fc+l)JV 



l{2l[<W o t Rn _{X t ) > 0] - 1 + Y t ) 



■-N+1 

k + l 

k 



inf 

AeM-f. ,/ieN* 



exp 



N 



N 2 



1 [9w,b ° {X l )Y l < 4i? max7/l ] 



log[20(fc + 1)N] log(^±i^) + l} + log 



2fc(/i+l) 
ei/(i? max ) 



AT 



Let us remark that <jj max (-^») = Xi, i = TV + 1, . . . , (k + 1)N, as soon as we 



consider only the values of i? max not smaller than max i= 



N+l,...,(k+l)N 



Xi 



this 



corollary. Thus we obtain a bound on the transductive generalization error of the 
unthrcsholded classification rule 2l[g Wi t,(Xi) > 0] — 1, as well as some incitation to 
replace it with a thresholded rule when the value of i? max minimizing the bound 
falls below max i=Ar+1| ... i ( fe+1 )jv||X i ||. 
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Appendix: Classification by 
thresholding 



In this appendix, we show how the bounds given in the first section of this mono- 
graph can be computed in practice on a simple example: the case when the clas- 
sification is performed by comparing a series of measurements to threshold values. 
Let us mention that our description covers the case when the same measurement is 
compared to several thresholds, since it is enough to repeat a measurement in the 
list of measurements describing a pattern to cover this case. 



5.1. Description of the model 

Let us assume that the patterns we want to classify are described through h real 
valued measurements normalized in the range (0,1). In this setting the pattern 
space can thus be defined as X — (0, l) h . 

Consider the threshold set T = (0, l)' 1 and the response set = y^ ' 1 } . For any 
t G (0, l) h and any a : {0, l} h -> y, let 



f(t.a)(x) = a{ [t{x 3 > tj)] k j=1 }, xeT, 



where x j is the jth coordinate of x G X. Thus our parameter set here is 6 = 
T x 3?. Let us consider the Lebesgue measure L on T and the uniform probability 
distribution U on 3?. Let our prior distribution be tt — L ® U. Let us define for any 
threshold sequence t G 7 

A t = {f G T : J¥~t~) n {XI- i = 1, . . . , N} = 0J = 1, . . . , h}, 

where Xf is the jth. coordinate of the sample pattern and where the interval 
(tj,tj) of the real line is defined as the convex hull of the two point set {t'j,tj}, 
whether t'j < tj or not. We see that A t is the set of thresholds giving the same 
response as t on the training patterns. Let us consider for any t G T the middle 

m(A t ) ^ 



i(At) 

of At- The set A t being a product of intervals, its middle is the point whose coordi- 
nates are the middle of these intervals. Let us introduce the finite set T composed 
of the middles of the cells A t , which can be defined as 

T = {t<E7:t = m(A t )}. 
It is easy to see that \T\ < (N + l) h and that \%\ = |y| 2 \ 
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5.2. Computation of inductive bounds 



For any parameter (t, a) G T x 31 = 0, let us consider the posterior distribution 
denned by its density 

dp(t,a) ( , , 1{? eA t )l(a' = a) 



dti 



7r(A t x {a}) 



In fact we are considering a finite number of posterior distributions, since p(t,a) = 
P(m(A t ),a)i where m(A t ) G T. Moreover, for any exchangeable sample distribution 
P G M^[(X x ty N+1 ] and any thresholds i G T, 



(x^ +1) t i )n{x|,i=i,...,iV} = 2 

Thus, for any (£, a) G G, 

v{p M [f.(X N+1 )] ? f M (X N+ i)} 



< 



< 



N + l 



2h 
N + l' 



showing that the classification produced by P(t, a ) on new examples is typically non- 
random; this result is only indicative, since it is concerned with a non-random choice 
of (t, a). 

Let us compute the various quantities needed to apply the results of the first 
section, focussing our attention on Theorem 12.1.31 fpage [54 )) . 

First note that P(t, a )( r ) = r [(^: a )]- The entropy term is such that 

X( Pt , a ,n) = -log[7r(A t x {r})] = - log[L(A t )] + 2 h log(| u |). 

Let us notice accordingly that 

min X{p M , tt) < h log(iV + 1) + 2 h log(|y |) . 

(t,a)£9 

Let us introduce the counters 
1 N 

W = N^ 1 { Y > = y and W ^ = c }' 

<GT,cG {0,l}' l ,yGy, 

6*(c) = £ 6* (c) = i^l{ [t(Xi > t 3 )]) =1 = c), t G T, c G {0, 1}\ 



i=i 



JV 



yea 



Since 



r[(t,a)}= ^(c)-&a(c)(c)], 
ce{o,i} h 

the partition function of the Gibbs estimator can be computed as 



N 



7r[exp(-Ar)] = ^L(A t ) E mT^ ex P - A X>[^ + /(*,-) 
teT aes L i=i 
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=e^) n 

ce{o,i} h 



teT 



yet 



We see that the number of operations needed to compute 7r[exp(— Ar)] is propor- 
tional to \T\ x 2 h x |y| < (N + l) h 2 h \y\. An exact computation will therefore be 
feasible only for small values of N and h. For higher values, a Monte Carlo approx- 
imation of this sum will have to be performed instead. 

If we want to compute the bound provided by Theorem 12.1.31 (page IM)) or by 
Theorem 12.2.21 fpagelM ]) . we need also to compute, for any fixed parameter 9 6 0, 
quantities of the type 

{exp[£m'(-,0)]} 



We need to introduce 

JV 

iV 



i=i 

Similarly to what has been done previously, we obtain 
7r{exp[-Ar + £m'(-,0)]} 

= E£(A t ) J] [ 1 l^exp('-A[6 t (c)-6*( C )]+e^ 1 c) 
ter ce{o,i} h Ll 1 yey ^ 

We can then compute 



7r C x P (-Ar)W = -7^1og{7r[exp(-Ar)] }, 

Ar) <^exp £p e (m ) f = -i ^ — — = ii, 

L J 7r exp(— Ar) 



7T, 



exp(- 



S 



Texp(-Ar)[W(-,0)] =-^7 log 7r{exp [-Ar + fm'(-, 0)] } 
^ IC=o 

This is all we need to compute B(pg, /3, 7) (and also -B(7r C xp(-Ar); /?, 7)) in Theorem 
12.1.31 (page [54]) , using the approximation 



l0g|7T exp( _ Air) exp{£7T t 



cxp(-A 2 r)(™')} } 

< log|vr CX p ( _ AlI ,) exp{£m'(-, 9)} } + £7r exp( _ A2r ) [m'(-, 0)] , £ > 0. 

Let us also explain how to apply the posterior distribution P(t, a ), m other words 
our randomized estimated classification rule, to a new pattern Xjv+i: 

p [tta) [f.(X N+1 )=y] ^L(At)- 1 [ t\a{[t(Xi, +1 >t' j )]*}=y]L(dt') 

J At 1 J 

= L(A ( )- 1 2 i({^A ( :[l(4 +1 >f;| =1 = C })l[a(c)=|/]. 

cG{0,l} h 



Let us define for short 

A t (c) = {t'6A t :[l(4 +1 >t^ =1 =c}, 



e{0,i} h . 
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With this notation 



p (t , a) [f.{X N+l )=y] =L(A t y 1 J2 L[A t (c)]t[a(c)=y\. 

c£{0,l} h 

We can compute in the same way the probabilities for the label of the new pattern 
under the Gibbs posterior distribution: 

7Tc X p(-Ar) [/-(-XjV+l) = V'] 

={e n f4^ exp (- A [ 6t(c) -^ (c) ] 

KteTce{o,i} h u 1 yey v 

v ^ rrA ^ 1 S ye ^(y-y , )exp{-A[b t (c)-fe*(c)]} 
Jtv E s6 «cxp{-A^(x)- 6 t( C )]} 



E L ( A *) II [ lUT oxp(-A[6*(c) - 6* (c) 
ce{o,i}" U 1 yey 



5.3. Transductive bounds 

In the case when we observe the patterns of a shadow sample (Xi)^L^~^ on top of 
the training sample (Xj, Yi)f =1 , we can introduce the set of thresholds responding 
as t on the extended sample (Xj)^ 1 ^ 

A t = [f e 7 :pn{^;i = 1, . . . , (k + 1)N} = 0,j = 1, . . . , h}, 

consider the set 

T={te7:t = m(A t )}, 

of the middle points of the cells A t , f e T, and replace the Lebesgue measure 
L G [(0, l) h ~\ of the previous section with the uniform probability measure L on 
T. We can then consider tt = L®U , where U is as previously the uniform probability 
measure on 31. This gives obviously an exchangeable posterior distribution and 
therefore qualifies it for transductive bounds. Let us notice that \T\ < [(k + 1)N + 

l] h , and therefore that n{t, a) > [(k + 1)N + l] ~ h \y \- 2>l , for any (t,a) ETxR. 

For any (i, a) £ T x 31 we may similarly to the inductive case consider the 
posterior distribution P(t,a) defined by 

dp(t,a) , , /\ _ l(f € A t )l(a' = a) 
dn [,a) 7r(A t x{a}) ' 

but we may also consider <5( TO (A t ) a )' wn i cn is such that rj{[m(At), a]} — r i[{t, a)], 
i = 1,2, whereas only pn,a)( r i) = r i[(t,a)], while 



P( t ,a)(r- 2 ) = * 2 r 2 [(t',a)]. 



t'GTnA, 



We get 



5.3. Transductive bounds 
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3C(p (tl0) ,7r) = - log[L(A t )] + 2 h logflyQ 

<log(|T|)+2 ft log(|y|)=3C(«5 [ro(St)i0]!7 r) 

< ftlog[(fc + l)iV+l] +2 /! log(|y|), 

whereas we had no such uniform bound in the inductive case. Similarly to the 
inductive case 

7r[e X p(-Arx)] = £l(A t ) [J flUT E e*p(-A[&*(c) - 6*(c)]) 
teT ce{o,i} h Ll ' y6V ^ ' 

Moreover, for any 6 € 0, 

7r{exp[-Ari +£/9e(m')] } = 7r{exp[-Ari + £m'(-,0)] } 

ter ce{o,i} h 

The bound for the transductive counterpart to Theorems 12.1.31 (page or 12.2.21 
(page [53]) , obtained as explained page 11151 can be computed as in the inductive 
case, from these two partition functions and the above entropy computation. 
Let us mention finally that, using the same notation as in the inductive case, 

1"exp(-Ari) [/-POv+l) = y'] 

= {e n 

E 1/ey l(y = y')exp{-A[6*(c)-6*(c)]} 



x E L t A *( c 

c6{0,l} h 



E^expf -A [&*(*) -6«(c)]} 



Ugt ce{o,i}' lLI 1 yey 

To conclude this appendix on classification by thresholding, note that similar fac- 
torized computations are feasible in the important case of classification trees. This 
ca n be achieved u s ing so me variant of the context tree weighting method discovered 
by IWillems et al. (|l995l ) and successfully used in lossless compression theory. The 
interest ed read e r can find a description of this algorithm applied to classification 
trees in Catoni (|2004 page 62). 
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